Q-learning optimizes spectrum management in cognitive radio networks

Int. j. inf. tecnol.
https://doi.org/10.1007/s41870-018-0130-3
ORIGINAL RESEARCH
Spectrum management in cognitive radio ad-hoc network using

Q-learning
Shiraz Khurana1 • Shuchita Upadhayaya1
Received: 25 November 2017 / Accepted: 28 March 2018

Bharati Vidyapeeth’s Institute of Computer Applications and Management 2018
Abstract Cognitive radio ad-hoc networks (CRAHN’s) is Keywords CRAHN Reinforcement learning Machine
the new buzzword wherever spectrum congestion is a learning Mobile agent
matter of discussion. It was proposed to handle spectrum
insufficiency, by shifting the traffic to those portions of
spectrum which are not in use at that time, without 1 Introduction
affecting the transmissions of legitimate users. In this paper
firstly we have discussed about CRAHN’s, secondly we Cognitive radio ad hoc networks was emerged to solve
have discussed about reinforcement learning and then problems associated with wireless networking due to lim-
challenges which are to be handled to implement rein- ited available spectrum. The problem of spectrum inade-
forcement learning in CRAHN’s. Reinforcement learning quacy can be removed by exploiting the usage of existing
can be implemented in many areas in cognitive radio ad wireless spectrum intelligently. Earlier the spectrum was
hoc networks, like spectrum sensing, spectrum manage- allocated using static spectrum allocation policies which
ment, spectrum mobility, and spectrum sharing. Primarily partitioned the radio spectrum into two parts: licensed and
in this paper our focus is on spectrum management. We unlicensed bands. Licensed bands are given to licensed
propose a reinforcement learning technique to handle users which don’t transmit at all the times results into
spectrum management. Our approach uses Q learning spectral resource waste, examples of licensed bands are
algorithm (implemented in python) to prove that if rein- mobile carriers, television broadcasting etc., Unlicensed
forcement learning is used in this area, there will be drastic users operate in unlicensed bands which support variety of
improvement in performance of CRAHN’s and it enables applications [e.g., sensor networks, personal area networks
efficient assignment of spectrum by maximizing long-term (PAN), mesh networks, wireless local area networks
reward. We then execute our algorithm against various (WLANs) etc.], increased number of users has caused
network scenarios. In the end we have concluded our congestion in this category of band [1].
experiment’s result, which shows that Q learning model The concept of dynamic spectrum access (DSA) was
takes some episodes to learn, after that it gives us highly proposed to improve these spectrum insufficiency issues.
optimized result, which saves secondary users to interrupt The software defined radio (SDR) allows the development
primary users with an extremely good accuracy. of such devices which can be adjusted to function in a wide
spectrum band, such devices are called cognitive radio
(CR) devices [2]. A cognitive radio (CR) is able to alter its
transmitting parameters like (operating spectrum, modula-
& Shuchita Upadhayaya tion, transmission power, and communication technology)
shuchita_bhasin@yahoo.com
by interacting with its operating environment. Thus CR
Shiraz Khurana technology can be utilized to sort the issue of uneven
shiraz.khurana@gmail.com
spectrum usage and inefficiency by identifying and trans-
1
Department of Computer Science & Applications, mitting through the seldom used portions of the spectrum.
Kurukshetra University, Kurukshetra, Haryana, India A network which is constituted with such kind of radio
123
devices is known as cognitive radio network (CRN). CRNs actuators (PESA) for the agent is given below in Table 1.
without any fixed infrastructure support are called as dis- This is just an abstracted view, there can be other param-
tributed network, in which nodes communicate with each eter as well.
other in an ad-hoc manner [3], and this is called cognitive The next section suggests the challenges which are to be
radio ad hoc networks (CRAHNs) as shown in Fig. 1. handled while implementing reinforcement learning in
CRAHNs aims at utilizing scarcely used spectrum bands to CRAHN’s.
increase overall spectrum efficiency.
In licensed band 1: PU are not using the spectrum so it’s 1.2 Reinforcement leaning with CRAHN
being used by cognitive user, in licensed band 2: the (challenges)
spectrum (channel) is being used by primary users, so it is
not free. So every time before using the spectrum CU sense CRAHN’s support two kinds of users: cognitive user (CU)
the spectrum whether it is being used by PU or not. and primary user (PU). PU are users which have a legiti-
Reinforcement learning (RL) is a type of machine mate access to the spectrum and CU is also known as
learning in which we allow machine to learn and change its secondary user (SU) can opportunistically access the
behaviour based on previous input/output and feedback spectrum when it not being used by the PU [5]. But it also
received from the environment. This behaviour can be makes sure that the licensed PUs immediately get the
learnt once and for all, or keep on adapting as time goes. spectrum for transmissions whenever it needs. So SU
The history of (RL) [4] is inspired by two different aspects: should always look for vacant spectrum slot which is also
one is psychology of animal learning and the other is known as spectrum holes. So now when a primary user in
problem of optimal control and its solution using value not using its allocated spectrum, the spectrum can be uti-
function and dynamic programming. The Q learning lized by secondary user. In case sudden appearance of
algorithm that we are going to use in this paper is based on primary user, there can be two possible choices for sec-
reinforcement learning. The rest of the paper is divided into ondary user to continue the operation. (1) The secondary
following sections: in Sect. 1.1 we have discussed about an user jumps to another frequency of spectrum holes. (2) It
agent’s role in reinforcement learning. In Sect. 1.2 we can operate in the same spectrum but it has to reduce its
review the literature and discuss the challenges in imple- transmission power so that there will be no disturbance to
mentation of reinforcement learning in CRAHN’s. In primary user. CRAHNs are multi-hop wireless networks, in
Sect. 2 we have defined the whole workflow of imple- which reinforcement learning techniques can be introduced
menting Q learning algorithm in CRAHN’s environment. to boost the overall performance of such networks. There
Section 3 represents the experimental results and we con- are numerous challenges which should be kept in mind
clude our paper in Sect. 4. before implementing any of the reinforcement learning
techniques in CRAHNs.
1.1 Role of agent in reinforcement learning
1.2.1 Challenges [6]
An agent is the one that can view or perceive its environ-
ment through its sensor (i.e., collecting percept) and can act • Multifunctional: An RL agent can be used to observe
upon that environment through its actuators. In case of many different functionalities related to transmission
CRAHNs the performance, environment, sensors and e.g., availability of radio resources, channel hetero-
geneity, cost of each transmission, power requirement,
delay (back off and switching) and mobility etc., The
main thing to focus here is that all these parameters are
to be stored in a state, receive an aggregate feedback
and optimize a general goal as a whole, e.g. throughput.
• Ambidextrous: RL techniques not only exploit the
fittest solution but also explore the solution space of the
behaviour to the dynamic spectrum. Since due to many
factors the path may get change, so exploration is also
to be done.
• Multi-agent: RL algorithm is not intended just to
implement a single agent functionality on a single node,
but it can also work in a distributed environment so that
total workload is reduced.
Fig. 1 Cognitive radio ad hoc network
123
Table 1 PESA description for agent

Performance Environment Sensor Actuators
Latency, throughput, response time, jitter, error rate, network utilization, arrival Mobile nodes working in Switches, Switches,
rate, bandwidth, packet loss, reliability, security and power consumption cooperative manner servers, servers,
routers routers
• Instability: In RL techniques after observing current Figure 2 shows agent’s interaction with the environ-
percept from the environment the cognitive radio (CR) ment, if any agent takes some action, then the environment
node need to perform action after learning from the changes its state from St to St?1. In the same way a reward
environment, so that feedbacks about the total cost of is given by environment. Figure 3 shows the relationship
each state-action pair are collected. Sometimes it may between the agent’s action, state and reward. For instance,
choose the suboptimal solution to explore the state agent is at state St and takes action At got a reward Rt?1
space which is not exploited yet. It can be considered as and moves to state St?1.
not getting stuck in local maxima. This activity may RL agent’s life is a Markov decision process. So let’s
lead to short-term performance degradation. So this can have some assumption related to our agent. An agent
make the network as instable. interacts with its environment in discrete time steps. The
• Convergence rate: In RL technique chance of getting objective of it is to collect maximum possible reward [8].
better solution highly depends on the number of percept
• S is the set of distinct states agent can perceive from
feed into the training data set. More is number of
environment.
training data sets more are the chances to get optimal
• A is the set of distinct actions which the agent can
solution. The amount of training data might reach to
perform.
infinite as well. However, in wireless networks this is
• At each discrete time step t, the agent senses the state s,
not realistic. So the convergence rate is slow for
chooses an action a, and executes it.
CRAHNs protocols. However, this is not a fundamental
• The environment responds with a reward or punishment
property for CRAHN due to the non-stationary charac-
r (s, a) and a new state s ? 1 = T (s, a). We denote
teristics of the environment.
reward as a number, which will have greater value for
• Impact of learning cost on power: The increased cost of
favourable, less for unfavourable. R: S 9 A ? R.
the learning process may impact the power constrained
• The reinforcement and next-state functions are not
devices. It also has more memory requirements.
known to the agent. It depends only on the current state
and action.
• Reward specify what agent need to achieve, not how to
2 Workflow for reinforcement learning achieve it.
In RL technique an agent interacts with environment at

2.1 Q learning [9]
discrete point of time. At each step, RL agent observes the
environment and selects action from the set of possible
It is a technique of reinforcement learning. Q-learning is an
action and receives a reward or punishment from the
intelligent algorithm which can handle problems with
environment for that specific action taken. The agent
stochastic transitions and rewards. It works by making a
receives feedback about the appropriateness of its response
policy by which action-state pair will return optimal
[7].
results. A policy (p) is a rule that the agent follows in
selecting actions, given the state it is in. When the corre-
sponding action value function is identified, the optimal
policy can be constructed by simply selecting the action
Fig. 2 Agent life cycle Fig. 3 Relationship between agent state & environment
123
with the highest value in each state. It has been shown that update. First time an action is taken the values from reward
[4] for any finite MDP, Q-learning finally finds an optimal matrix is used to set the value of Q matrix. This will make
policy, meaning that the expected value of total reward an immediate learning in case of fixed deterministic
returns in all subsequent steps from the current state is the rewards. This process can be repeated many times based on
maximum achievable. It can be applied in any of following the number of episodes the user wants to proceed. In this
fields in CRAHN: spectrum sensing, spectrum decision, also the matrix row denotes states and column denotes
spectrum mobility, and spectrum sharing. Our prime focus action.
in this paper is on spectrum management. Experiment So initially we start with Q matrix with all entries as 0,
results show that Q learning is able to sustain in a high in the very first step, all possible action will have equal
network load as compared to traditional techniques. During weightage, now agent can choose any action randomly.
the initial period of time when the Q learning agent is After taking that action environment will return some
learning from the example, it might show some strain in numerical reward, which is stored for future actions. So, it
performance but that is almost negligible. But when the means initially when Q matrix is empty or almost empty
load is high it actually outperforms the traditional one. Q then agent will do more exploration and once entries are
learning not only manages into higher loads but also when filled and learning is complete, it tries to exploit the
there is occurrence of primary user. existing solution by selecting the maximum reward avail-
able. The ultimate goal of an agent is to maximize this
2.2 Q-learning algorithm [10, 11] score. For each state we have to find an action that gen-
erates optimal values. The long term reward for each state
Step 1 set the value of discount factor ‘c’ will be considered as optimal action.
Step 2 create reward matrix R
Step 3 initialize Q matrix by zero
Step 4 for each episode repeat the following steps: 3 Experiment
Step 5 select a random start state, from that start state
select a random valid action state In our example we have taken a 20 node network which
Step 6 compute Q (state, action) = R (state, action) ? c can be represented using a 20 * 5 matrix, we can consider
* Max [Q (next state, all actions)] these 20 nodes as states (like agent will be in which state at
Step 7 set the next state as the current state a particular time) and 5 operations: up, down, left, right and
no change (in this case the agent will remain at the same
state). To simulate the whole experiment, we have used
X
1
Python language and numpy library. Initially we have
p¼ cRat ðSt ; Stþ1 Þ
taken the Q matrix as all zero values. We have generated
t¼0
the position where the primary user is appearing, where the
The algorithm consists of various terms which are defined goal is and which state is the start state (all these positions
as follows: are generated randomly). For our experiment state space
Discount factor (‘c’): the value of discount factor (‘c’) Fig. 4 given below.
determines how much importance you want to give to We have created a reward matrix in which: ? 1 value is
future rewards. Its value is always in between 0 and 1. Say, assigned for all movements (from one node to another with
you go to a state which is further away from the goal state, an intention to find Goal) where neither there is a primary
but from that state the chances of encountering a state with
primary user is less, so, here the future reward is more,
even though the instant reward is less. A value close to zero
will not allow the agent to explore those choices in which
current reward is less but long term reward is more, while a
factor approaching to one will make it strive for a long-
term high reward.
Reward matrix (R): This matrix contains the reward
which agent will receive after moving to all possible set of
states. In reward matrix row and column denotes states and
action respectively. The way network is available we enter
the value in reward matrix.
Initial condition (Q matrix): it is an iterative algorithm;
it implicitly assumes an initial condition before the first Fig. 4 State space diagram
123
user nor goal state. - 10 value is assigned for all those

moves which results into states which have primary user,
PU hit Chart
? 10 value is assigned for all those moves which result 100%
Success Percentage
into goal state. - 1 value is assigned when an agent stays 50%
on the same state instead of approaching towards the goal 0%
and - 10 in case when the state contains primary user. The NONONONO
NO NO yes
objective of an agent is to maximize its score. We have yes yes
1 2 3 4
initialized the Q matrix by all 0’s initially. 5 6 7
8 >=9
Then we have created a transition matrix in which we
have recorded all set of actions both valid and invalid, e.g., Number of Episode & Network Trained (Yes or No)
we cannot move left or up from state 1 (our start state)
since this is an invalid move, in the same way right and PU hit Result
down are valid moves. In this fashion we store all actions
for each state in a matrix (invalid move are stored by - 1) Fig. 5 Results
and valid moves are stored by resulting state number like,
if we are at state 1 and we perform a down action, we reach number of times we hit a state in which to primary user is
are state 5 and so on for every state. Now it’s time to make active. As a result, we get a Q matrix which has assigned
our agent to learn from the environment and then test it, certain weight to each state action pair, now based on these
whether it reaches goal on time or not. We generate 5 values we can predict future path from starting state to the
random states, 3 states as primary user and 1 state as start goal state. The maximum value for a specific action
state and 1 state as goal state. For goal state we can use 2 according to state is preferably chosen, in case there is tie
different approaches: between various values of q matrix corresponding to a
specific state then we can choose any of this action
1. Generate goal state randomly for each iteration. randomly.
2. Make goal state as fix. Now we can stop learning when we see our Q matrix is
We can use both approaches. In the first approach it will getting stabled or not much change is there, we start at a
take extra time as compared to the second one to learn, and random start state, and our objective is to reach goal state
the result will be totally random in nature that is for every without hitting any primary user. We have repeated this
set of iteration it will take different episode to learn and experiment 1000 times to calculate the performance of our
different results. The second approach we fix the goal state, agent and validity of our Q-matrix. This is done after
now we repeat the experiment for 8 episodes (this number learning for 7 episodes. The accuracy and graphs are shown
can be chosen based upon when you see that there is not below in Fig. 5. The results are the following: our network
much change in the value of Q matrix). takes 7 episodes to get trained, before that we encountered
Every time we start the experiment, we can choose a PU in the path to reach the goal infinite times, as our matrix
random start state to begin. Now for the training phase we is not stable, it may have stuck in infinite loop as well. In
randomly generate the action at each state. Now for each 7th training episode on an average there will be 10 PU hits
state we repeat the experiment several times. Based on the (in 1000) iteration with accuracy of 99%. If we train our
reward received we assign a weight to it. For instance, if network for 8 or more than 8 times, it never hits on the cell
you are at state 16 the action ‘down’ has more weight as which contains PU (for a 20 node network). Next time
compare to ‘left’ or ‘up’, similarly in state 20 we have whenever PU appears on another cell, the same process is
more weight for ‘no action’ as compare to all other. This repeated that is training and execution.
way we can find the best possible channel selection from
start to goal node. This way agent will learn and remember
the moves which give a good reward for every state. 4 Conclusion and future scope
In case there is sudden appearance of primary user in
any state the reward of that state will start to vanish (again In this paper firstly we have introduced the challenges
agent will back to learning phase), and we can choose the which we need to encounter before implementing rein-
best possible next option. After executing for a specified forcement learning into cognitive radio ad hoc networks
number of iteration, now our agent has learnt and as an (CRAHNs). Then we have suggested that using Q learning
intermediate output we have got the Q matrix with weight approach, we can get optimized result. We have written
values for each state-action pair. one program in python based on Q learning algorithm to
We have written code in python based on Markov calculate number of hits to primary user. As a result, we get
decision process model. Q learning algorithm calculates a Q matrix which has assigned certain weight to each state-
123
action pair. Based on these values we can predict future 4. Al-Rawi HAA, Yau KLA (2013) Routing in distributed cognitive
path from starting state to the goal state. The experiment radio networks: a survey. Wireless Pers Commun
69(4):1983–2020. https://doi.org/10.1007/s11277-012-0674-7
results show that after few episodes of learning the agent 5. Sutton RS, Barto AG (1998) Reinforcement learning: an intro-
never go to state which contains, primary user. The future duction. Bradford Books, Cambridge
work will be focused on implementing similar kind of 6. Russell SJ et al (2010) Artificial intelligence: a modern approach.
learning in construction of routing table for CRAHN’s. Prentice-Hall Inc, Upper Saddle River, NJ, USA. ISBN 0-13-
103805-2
7. Akyildiz IF, Lee WY, Chowdhury KR (2009) Spectrum man-
agement in cognitive radio ad hoc networks. IEEE Netw
23(4):6–12. https://doi.org/10.1109/MNET.2009.5191140
References 8. Wu C, Chowdhury K, Di Felice M, Pourpeighambar B, Dehghan
M, Sabaei M (2010) Spectrum management of cognitive radio
1. Akyildiz IF, Lee WY, Chowdhury KR (2009) CRAHNs: cogni- using multi-agent reinforcement learning categories and subject
tive radio ad hoc networks. Ad Hoc Netw 7(5):810–836. https:// descriptors. Auton Agents Multiagent Syst 106:1705–1712.
doi.org/10.1016/j.adhoc.2009.01.001 https://doi.org/10.1016/j.comcom.2017.02.013
2. Akyildiz IF, Lee WY, Vuran MC, Mohanty S (2006) NeXt 9. Felice MD, Chowdhury K, Wu C, Bononi L (2010) Learning-
generation/dynamic spectrum access/cognitive radio wireless based spectrum selection in cognitive radio ad hoc networks.
networks: a survey. Comput Netw 50(13):2127–2159. https://doi. Auton Agents Multiagent Syst. https://doi.org/10.1007/978-3-
org/10.1016/j.comnet.2006.05.001 642-13315-2_11
3. Cesana M, Cuomo F, Ekici E (2011) Routing in cognitive radio 10. Aneek Das.https://medium.com/towards-data-science/introduc
networks: challenges and solutions. Ad Hoc Netw 9(3):228–248. tion-to-q-learning-88d1c4f2b49c. Accessed 04 Oct 2017
https://doi.org/10.1016/j.adhoc.2010.06.009 11. http://mnemstudio.org/path-finding-q-learning.htm. Accessed 04
Oct 2017
123

Q-learning optimizes spectrum management in cognitive radio networks

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Q-learning optimizes spectrum management in cognitive radio networks

Diunggah oleh

Hak Cipta:

Format Tersedia

Int. j. inf. tecnol.

Spectrum management in cognitive radio ad-hoc network using

Received: 25 November 2017 / Accepted: 28 March 2018

Table 1 PESA description for agent

In RL technique an agent interacts with environment at

user nor goal state. - 10 value is assigned for all those

Anda mungkin juga menyukai