A,Multi-Agent,Reinforcement,Learning-Based,Collaborative,Jamming,System:Algorithm,Design,and,Software-Defined,Radio,Implementation

【www.zhangdahai.com--其他范文】

Luguang Wang,Fei Song,Gui Fang,Zhibin Feng,Wen Li,Yifan Xu,Chen Pan,Xiaojing Chu

College of Communications Engineering,Army Engineering University of PLA,Nanjing 210000,China

*The corresponding author,email:songfei2021123@163.com

Abstract:In multi-agent confrontation scenarios,a jammer is constrained by the single limited performance and inefficiency of practical application.To cope with these issues,this paper aims to investigate the multi-agent jamming problem in a multi-user scenario,where the coordination between the jammers is considered.Firstly,a multi-agent Markov decision process(MDP)framework is used to model and analyze the multi-agent jamming problem.Secondly,a collaborative multi-agent jamming algorithm(CMJA)based on reinforcement learning is proposed.Finally,an actual intelligent jamming system is designed and built based on software-defined radio(SDR)platform for simulation and platform verification.The simulation and platform verification results show that the proposed CMJA algorithm outperforms the independent Q-learning method and provides a better jamming effect.

Keywords:multi-agent reinforcement learning;intelligent jamming;collaborative jamming;softwaredefined radio platform

In recent years,wireless jamming technology has become a key topic in spectrum security[1].Traditional jamming methods mainly include fixed frequency jamming,sweeping jamming,comb jamming and so on,which usually obeys the preset operating modes and can be easily avoided by dynamic spectrum access[2–4].Moreover,with the promotion of artificial intelligence and software radio technology,the intelligence of target system is improved and the antijamming technology is developed continuously[5–13],causing serious challenges to traditional jamming means.Thus,the intelligent jamming technology combined with machine learning has been widely studied in recent years.However,the existing works mainly focused on the single-user scenario and lack the verification of actual platform.As a result,there is a need to continuously strengthen the jamming capability in a multi-user scenario to cope with the communication confrontations in complex environments.In this paper,we investigate the problem of jamming channel selection in a multi-user scenario,where collaboration between jammers is considered.

However,the following problems in the study of decision-making in the jamming process have still not been fully solved.Firstly,it is difficult to attack multiple users since the jamming ability of a single jammer is limited.Secondly,group intelligence confrontation is a developing trend and the existing jamming algorithms are inefficient in dealing with opponents having increasing anti-jamming ability.Finally,some researchers have proposed intelligent jamming theory algorithms but have not verified them in actual systems.Therefore,conducting research on multi-agent jamming methods and improving the overall jamming efficiency of the jammers is an effective means for addressing the lack of independent jamming effects.

To tackle the above-mentioned problems,it is of urgency and necessary to study jamming models and intelligent algorithms of multi-agent jammers in a multiuser scenario and establish the long-term advantages of jammers in communication confrontation.However,to implement a collaborative multi-agent jamming algorithm,several key challenges need to be addressed:(i)The model should be reasonable.The reasonable model framework is the premise of subsequent learning and decision.In particular,the design of the jamming reward value is the key to optimizing the jamming decisions.(ii)The jamming policy should be effective.For different communication modes of users,it is important to ensure the effectiveness and applicability of the jamming policy.(iii)The collaboration between jammers should be efficient.Efficient collaboration between jammers can avoid the wastage of the jamming resources and maximize resource utilization.

In a multi-user communication scenario,we analyze the jamming problem based on the multi-agent Markov decision process(MDP)framework and design the reward value of jamming evaluation.To avoid decision conflicts between jammers,we design a collaboration jamming mechanism between the jammers and propose a jamming algorithm based on multiagent reinforcement learning.The specific contributions of this paper are summarized as follows:

·In order to avoid conflicting decisions between jammers,a collaborative jamming algorithm(CMJA)based on multi-agent reinforcement learning is proposed,with the function of distributed calculation and collaborative decision.

·The simulation results verify that the proposed CMJA algorithm is effective and outperforms the independent Q-learning algorithm and the convergence proof of the proposed algorithm is given.

·A practical collaborative jamming system based on the software-defined radio(SDR)platform is designed and built,which contains three subsystems:an intelligent jamming system,a wireless transmission and communication subsystem and a confrontation visualization subsystem.The proposed CMJA algorithm is verified in the practical system.

The rest of this paper is organized as follows:The related work is reviewed in Section II.The system model and problem formulation of the collaborative decision-making of multiple jammers are presented in Section III.In Section IV,the details of the proposed CMJA algorithm is described.Simulation results and a discussion on them are given in Section V.In Section VI,the multi-agent jamming system is introduced.Finally,the conclusion is conducted in Section VII.

There is no doubt that communication confrontation has attracted a great deal of attention and has become a research hotspot in recent years.In the field of antijamming communication,various methods and theories,such as game theory[5–11]and machine learning[12,13],have been proposed.Some studies[5–10]modeled and solved the interaction between users and jammers using the game theory to find the best anti-jamming decisions.In[11],the authors used the Stackelberg game to develop a model and assumed that the user owns the power decision set of the jammer.Machine learning has made significant advances in making decisions in dynamic environments.In the study by Liu et al.[12],the spectrum waterfall diagram was fed into a convolutional neural network,which used the deep reinforcement learning(DRL)algorithm to achieve the anti-jamming effect.Chen et al.[13]used the DRL algorithm to optimize the power selection in anti-jamming communications and implemented their proposed algorithm using the universal software radio peripheral(USRP)devices.It can be seen that many achievements have been made in the field of intelligent anti-jamming.On the contrary,only a few research studies on intelligent jamming have been carried out.Therefore,it is necessary to study intelligent jamming techniques to implement precision jamming.

In[14],the authors classified the traditional jamming means into four models:constant jamming,deceptive jamming,random jamming and reactive jamming.Sweep jamming and block jamming were also considered in[15].Xu et al.[16]gave a definition of communication jamming.According to the definition,the jamming types could be divided into physical layer jamming and link layer jamming[17,18].The premise of link layer jamming is that the link layer information(protocol and frame format,etc)of the users is known.This is difficult to achieve in reality.Therefore,it is more straightforward and effective to study physical layer jamming methods.

Based on the above-mentioned studies,various types of learning methods of jamming have been proposed[19–23].Shi et al.[19]proposed an intelligent jamming algorithm in the frequency domain,in which the jammer can sense and learn the user’s channel switching policy to achieve tracking jamming.Amuru et al.[20]proposed a power domain intelligent jamming algorithm,where the jammer can adjust its power according to the state of the users.In[21],the authors gave the jamming effect based on reinforcement learning algorithm in different antijamming strategies.Zhang et al.[22]proposed a jamming method for virtual decision-making and validated it using the USRP platform.A deep learningbased jammer used generative adversarial networks to achieve accurate jamming with a limited number of samples in[23].However,the above works are all based on a single jammer,which is inefficient in dealing with multi-agent confrontation scenes.

It is important to study communication confrontation of multi-agent cooperation.Some studies have applied multi-agent reinforcement learning to antijamming scenarios[24–27].Smart users can avoid mutual interference and external malicious jamming signals through cooperation,which enhances the antijamming ability.At present,the research on collaborative jamming is mainly aimed at friendly jamming to ensure the security of one’s own communication when facing an eavesdropping enemy[28–30].Some literatures on multiple jammers mainly focus on cooperative spoofing for radar detection.In[31],the authors investigated a power allocation game theoretic problem between a radar system and multiple jammers to determine the optimal power allocation.Chang et al.[32]proposed a novel jamming problem modeling idea to estimate the optimal jamming amplitude.To confront the threat of radar-net,an artificial bee colony based jamming resource allocation algorithm was proposed in[33].In[34],the cooperative perception of electromagnetic information of radiation source target is realized by using the advantage of multiple jammers’information sharing.However,in the research on the work of multiple jammers,there is relatively little research on the selection of jamming channels to destroy the opponent’s communications.Therefore,the research content of this paper is the collaborative jamming of multiple jammers to one’s opponent,which destroy the normal communication of one’s opponent by selecting jamming channels.

Figure 1.Schematic of the system model.

To sum up,the existing studies mainly focused on the intelligent jamming technology of a single jammer,and did not consider the collaborative jamming using multiple jammers.Therefore,we are inspired to study the collaborative jamming problem of multiagent jammers.For the same,we propose a collaborative jamming algorithm based on multi-agent reinforcement learning and verify it in an actual system.

3.1 System Model

The system model is shown in Figure 1 and considers a scenario withMintelligent jammers andNcommunication users(transmitter-receiver pairs).The smart jammer has an intelligent decision-making capability and consists of a spectrum-sensing subsystem and a jamming-decision subsystem.The set of jammers and users are denoted byM={j1,...,jM}andN={u1,...,uN},respectively.There areKavailable channels,which are denoted asK={f1,...,fK}(K≥N).We consider a time-slot system and the length of each time slot for the user and the jammer is the same.Each jammer selects a channel in each time slot to release jamming signals and each user can select only one transmission channel.

Figure 2.The change model of user channel.

Figure 3.The diagram of jamming time slot structure.

We assume that the users adopt probabilistic frequency hopping models based on a preset sequence[21],which can collaborate to avoid the internal interference.

For example,we assume that thenth user’s frequency hopping sequence isF={fn1,fn2,···,fnK-1,fnK},whereKdenotes the number of channels available.The channel of thenth user in thetth slot isfkn,and we denote this asCn(t)=fkn.So the user’s channel in the next slot can be express as:

whereε∈(0,1)denotes the probability that the channel will remain unchanged andk′=(k+1)modKdenotes thek′th channel in frequency hopping sequenceF.Eq.(1)shows that the user remains the current channel with probability ofεand switches to the next channel according to the user’s frequency hopping sequence with probability 1-ε.

All users collaborate to execute actions according to Eq.(1)in the same time slot.For different usersMandN,we have

As shown in Figure 2(K=6,N=2),the horizontal axis indicates the time slot,the vertical axis indicates the available channel,the shaded area indicates the channel used by the users,and the blank area indicates the idle channel.The user’s channel will change over time.

The intelligent jammer has sensing and learning abilities,which can sense the current communication frequency and learn the frequency usage pattern of the user to generate efficient intelligent jamming strategies.Figure 3 shows the time slot structure diagram of the jammer and the user.Each jamming time slot contains a jamming sub-slotTj,a sensing sub-slotTwss,and a learning sub-slotTl.Tjis used for releasing the jamming signals,Twssis used for sensing the wideband spectrum,andTlis used for learning locally.Tuis the user’s communication time.The length of a jamming time slot isTj+Twss+Tland the length of the user’s time slot isTu.The jamming time slot is assumed to be equal toTu.Different jammers interact with each other to make collaborative jamming decisions.

3.2 Problem Formulation

The working principle of a single intelligent jammer involves sensing the spectrum state and making decisions through learning,which in turn,can affect the current state.This sequential decision process is suitable for modeling with the Markov decision process.However,in the multi-agent scenario considered in this paper,any action of an agent can affect the state.Thus,we use an extension of MDP for modeling in multi-agent scenarios.

A multi-agent MDP can be represented as:

where the specific meaning of each element is as follows:

·M={j1,...,jM}denotes the set of intelligent jammers.

·Sdenotes the environmental state space;st∈Sis the element of the state space and indicates the environment state of the jammers.

·Am,m=1,...,Mdenotes the action space of the intelligent jammerjm;am∈Am,{Am=[f1,...,fK]},denotes the strategy chosen by the jammerjm.

·Pr:S×A1×...×AM→[0,1]denotes the state transition probability function,which represents the probability of the state moving tos′after each jammer executes its actionam∈Amin the states.

·rm:S×A1×...×AM→R denotes the immediate reward obtained after the jammerjmexecutes an actionam∈AMin the states.

The state of the environment is defined as follows:

whereun(t)∈{f1,...,fK}denotes the channel which thenth user is communicating in thetth time slot.

Each jammer selects its jamming channel in the statest,and the independent action space of each jammer is the same:A1=A2=···=AM.The independent action space of any jammer can be expressed as:

where the actionam∈{f1,...,fK}denotes the jamming channel of the jammerjm.The collaborative action a={a1,...,aM}denotes the combination of the jamming action between the jammers.Thus,the collaborative jamming action space can be expressed as follows:

where⊗represents the Cartesian product operation.

The transition of the state depends on the change in the user’s channel,and this is hard to predict because the behavior of the users is unknown to the jammers.

In this paper,we consider the effect of jamming suppression quantified as the reward value.When the jammerjmtakes an actionamthat can successfully block any user channels,the independent reward value ofjmis 1,otherwise,it is 0.Considering the collaboration between the jammers when another jammerjntakes the same action,i.e.,n/=mbutan=am,the reward value minus.The joint reward value of the jammerjmin thetth slot is defined as:

andδ(p,q)is expressed as:

When all the jammers take a joint action a={a1,...aM},the immediate reward value of each jammer and the sum of the overall reward values can be obtained.The total reward value of the jammers taking the joint action a={a1,...aM}in the statestis expressed as:

We define the decision policy of the jammerjmasπm,and the joint policy for all jammers asπ={π1,...,πM}.The common goal of the multi-agent is to obtain the optimal joint policyπ*={π*1,...,π*M}.Each jammer can obtain the maximum long-term cumulative discounted reward by executing the optimal policyπ*.

Therefore,each jammer aims at maximizing the cumulative expected reward,which can be expressed as:

wherest+τand at+τdenote the state and joint action,respectively,in thet+τtime slot.Eπ[·]is the mathematical expectation under the joint policyπ,rmis the immediate reward value that the jammerjmobtains after executing the actionπm∈π,and 0≤γ<1 denotes the discount factor for long-term rewards.

Figure 4.Schematic showing of the collaborative multi-agent jamming framework.

4.1 Algorithm Description

Multi-agent MDP is suitable for solving with the reinforcement learning algorithm.Q-learning algorithm is a classical model-free reinforcement learning algorithm that works on a“decision-feedback-update”mechanism[35].The Q-learning method stores all the Q-values corresponding to the“states-actions”by creating a Q-value table.The agent makes a decision based on the current state and updates the Q-value table with the obtained reward values.

Motivated by[36],we propose a collaborative jamming algorithm based on multi-agent reinforcement learning for the multi-agent MDP model.As illustrated in Figure 4,each jammer maintains an independent Q-value table and the central server maintains a collaborative Q-value table.Each jammer updates its Q-value table based on the state it senses and the reward it gets,whereas the central server receives all the independent Q-values information to update the collaborative Q-value table and make collaborative decisions.Therefore,the process of updating the Q-value table realizes the function of“distributed calculation and collaborative decision”.The jammerjmupdates its Q-value table according to the following equation:

whereαdenotes the learning rate andγdenotes the discount factor.Thest+1denotes the next state after the execution of the collaborative action atin the statestandrm(st,am)denotes the immediate reward of the jammerjmobtained after all the jammers take the collaborative action atin the statest.The a*denotes the collaborative action in the statest+1that causes all the jammers to obtain the maximum gain value,which is given by the following equation:

In this case,the multi-agent Q-learning algorithm in Eq.(11)is computed in a distributed manner.Each jammer updates its Q-value individually and maintains the same collaborative Q-value table together.However,for Eq.(12),a global coordination policy with common rewards needs to be solved:

whereQm(st,a)denotes the Q-value of the jammerjm,which can be called as the independent Q-value table.Q(st,a)denotes the collaborative Q-value table that needs to be maintained and updated by all jammers.Therefore,the update ofQ(st,a)can be transformed into updating theQm(st,a)of each jammer.According to Eq.(12),the jammers can obtain the optimal collaborative policy when all the Q-values,Qm(st,a),in the collaborative Q-table converge to the optimal value.

To avoid the reinforcement learning algorithm from falling into the“exploration-exploitation”dilemma,we use theε-greedystrategy to balance exploration and exploitation.The jammers randomly select the joint action a∈Awith a probability ofε,and select the joint actionwith a probability of 1-ε.To achieve a smooth transition of the decision-making actions from exploration to exploitation,we designεas follows:

whereε0denotes the initial exploration,λdenotes the rate parameter andtdenotes the iteration time.As the number of iterations increase,the value ofεgradually approaches 0 and the jammers tend to select the joint actions

Based on the above analysis,the collaborative multiagent jamming algorithm(CMJA)is proposed and its steps are given in Algorithm 1.

4.2 Convergence Analysis

As discussed above,the optimal policy under each state is given by the collaborative Q-table expressed by Eq.(13).TheQm(st,a)of each jammer is calculated independently and the collaborativeQ(st,a)is their sum.The convergence of the collaborative Qlearning algorithm can be guaranteed by the convergence condition of single-agent Q-learning.Referring to the articles[36–38],the convergence condition of collaborative Q-learning is given in Theorem 1.

Algorithm 1.Collaborative multi-agent jamming algorithm(CMJA).Initialization:S,Q(st,a),Qm(st,a);1:For:t=0,...,T do 2:Each jammer observes its current state st={u1(t),...,uN(t)}and selects a channel according to the following rules:·The jammer jm randomly chooses a channel profile a∈A with a probability of ε.·The jammer jm chooses a channel profile a*∈arg max Mimages/BZ_50_1695_932_1743_978.png Qm(s′,a′)with a probability of 1-ε.3:Each jammer calculates its reward rm(st,am).4:The state is transformed into st+1={u1(t+1),...,uN(t+1)}.Eq.(11)and Eq.(13)are used for updating the values of Qm and Q,respectively,in the state st.5:End for a′m=1

Theorem 1.Given the bounded rewards rm and the learning rate α∈(0,1),if α satisfies:

the agent will converge to the optimal policy as ε→0.

Therefore,the convergence of the proposed algorithm can be guaranteed as long as the learning rateαis set to meet the above conditions.

During the iteration process,the Q-value table is constantly updated until it can no longer be updated or the changes in it are very small,which indicates that the Q-value has converged to the optimum value and the policy made on the basis of the collaborative Q-table is the optimal policy at that moment.

4.3 Algorithm Complexity Analysis

Motivated by[39],we analyze the complexity of the proposed algorithm.The algorithm complexity can be expressed as O(F),which is related to the code that is executed the most number of times in the algorithm.The proposed algorithm consists of three main stages:making a collaborative decision,calculating the reward value,and updating the Q-value table.We assume that the number of jammers isM,the number of user pairs isN,and the number of channels isK.The complexity of the three stages in one time slot are analyzed as follows.

The jammers take a collaborative decision action according to Eq.(12)and the complexity can be expressed asCd=O(ANK×KM),whereANKandKMdenote the number of rows and columns of the collaborative Q-value table,respectively.

The complexity of the reward value calculation is based on the designed reward value in Eq.(7),which can be expressed asCr=M×O(N+M).Nindicates the number of times the jammerjmrequires to find whether there currently exists a communication channel with the same actionamas taken,andMdenotes the number of times required to calculate the immediate reward value of the remaining jammers.

The Q-value table updating process requires updating the Q-value tables ofMjammers independently and the complexity can be expressed asCu=M×O(ANK×KM),whereANKandKMdenote the number of rows and columns of an independent Q-value table,respectively.

We assume that the number of iteration time slots isTnum.Therefore,the complexity of the proposed CMJA algorithm can be expressed by the following equation:

It can be concluded that the complexity of the proposed algorithm increases exponentially with the number of jammers and channels,thus making it suitable for use in small-scale scenarios.

This section presents the simulation results.Consider a scenario with two jammers and two user pairs,i.e.,M=N=2.The users have 10 available channels,i.e.,K=10.Table 1 gives the main parameters for the CMJA algorithm.The initial values of the parameters are chosen based on empirical values and further tuning of the parameters is performed during the simulation.In particular,a compromise value for the discount factorγis chosen to balance the present and future rewards,whereas a smaller value for the learning rateαis chosen to ensure a balanced weighting of the reward values.The other main parameters are shown in Table 1.

Table 1.Parameter values used in the simulation.

To verify the effectiveness of the proposed CMJA algorithm,we compare it with the independent Qlearning algorithm[22].Each jammer executes the classical independent Q-learning method without considering the coordination among the jammers.In the simulation,we verify the performance of the proposed CMJA algorithm by two indexes,the jamming success rate and the normalized throughput of the users.In addition,we assume that the jammers can detect and calculate the user’s ACK messages.Therefore,we define the jamming success rate as:

whereSsuc,jdenotes the number of packets successfully jammed andStot,jdenotes the total number of the packet count.The normalized throughput of the users is defined as:

whereScur,tdenotes the number of packets currently being successfully transmitted andSno,jdenotes the number of packets transmitted without jamming.To make the simulation results clear and intuitive,the jamming success rate and normalized throughput of every update time slot are averaged over every 20 jamming time slots.The results are obtained by averaging over 50 independent runs.

Figure 5.Jamming success probability in mode I.

In the simulations,as a comparison,we assume that the users have the following two channel switching modes:

Mode I:The users communicate with each other using a fixed sequence frequency hopping mode.

Mode II:The users use probabilistic frequency hopping to communicate according to Eq.(1).The current channel is selected to reside with a probability of 30%and the next channel is selected with a probability of 70%.

5.1 The Simulation Results for Mode I

Figure 5 shows the jamming probability curve of the jammers.The simulation results show that at the beginning of the algorithm,the success rate of the proposed CMJA algorithm as well as the independent Q-learning algorithm are low and approximately the same.As time goes by and the jammers continue to learn until the policy table converges,the jamming success rate of the CMJA algorithm can achieve 100%,whereas the independent Q-learning algorithm achieves 50%percent of that.

Figure 6.The change in the user’s throughput in mode I.

Figure 7.Jamming success probability in mode II.

Figure 6 shows a comparison of the normalized throughput of the CMJA algorithm and the independent Q-learning algorithm.The throughput with the independent Q-learning jamming algorithm gradually decreases over time,eventually stabilizing around 30%.This is because there is no cooperative association among the jammers,each of which selects channels independently.Two jammers can take the same action in one time slot,and thus some users can communicate normally.The CMJA algorithm considers the coordination between the jammers and makes optimal decisions that can successfully jam on two user channels simultaneously.Thus,the jammers gradually find the optimal jamming policy and the throughput gradually decreases and eventually converges,fluctuating around 5%.

5.2 The Simulation Results for Mode II

Figure 8.The change in the user’s throughput in mode II.

The jamming success probability curve for the users when taking mode II to communicate is shown in Figure 7.In this case,the users will probabilistically select the communication channel instead of using a fixed channel switching policy.It can be seen that the CMJA algorithm can be used for learning the frequency usage policy of the users to jam the communication channels with a certain probability.In contrast,when jammers adopt the independent Q-learning algorithm,the success rate of jamming is low due to the uncertainty of the users’channel switching and the independence between the jammers.With a user channel transition probability of 70%,the jammers executing the proposed CMJA algorithm can successfully jam the user data with a probability of 70%.

Figure 8 shows the variation of the normalized throughput due to the usage of the CMJA algorithm and the independent Q-learning algorithm.The normalized throughput can be maintained at a high level when the jammers adopt the independent Q-learning algorithm because of the low success rate of jamming.Statistically,around 60% of the data is able to be transmitted properly and 40% of the user data is successfully jammed.When the jammers execute the proposed CMJA algorithm,the user normalized throughput fluctuates around 35% during the convergence phase and around 65% of the user data is successfully jammed.Compared to the independent Qlearning algorithm,the normalized throughput of the users drop by approximately 25%.

The reason for the large fluctuation of the curve in Figure 7 and Figure 8 is that the channel switching of the user is uncertain.When counting after every 20 time slots,there is an uncertainty in the number of times the channel is selected to reside.In addition,when the users choose to reside in the current channel in the next time slot,the jammers tend to select the next channel with a larger Q-value,which causes a decision error at this point and thus the curve exhibits some fluctuation.

Based on the analysis described above,we can learn that the proposed CMJA algorithm exhibits superior performance compared to the independent Q-learning algorithm.The latter does not consider the coordination between jammers and each jammer selects its channel independently.Thus,different jammers can make the same decision,which results in a waste of spectrum resources.However,in the proposed CMJA algorithm,not only are the actions of the users learned but the coordination between the jammers are also considered.Thus,the jamming effect is better than that achieved by the independent Q-learning method.

This section describes a multi-agent jamming system built using a software radio platform,which is based on a Linux system and C++development software for the overall system design.The system uses NI USRP2920 and B210 devices as the hardware platform.The composition of the system is shown in the Figure 9.In terms of the functional composition,the system contains three subsystems:an intelligent jamming subsystem,a wireless transmission and communication subsystem,and a confrontation visualization subsystem.The intelligent jamming subsystem contains two submodules,namely,spectrum-sensing submodule and intelligent decision-making submodule,and the two submodules coordinate to drive the implementation of the proposed CMJA algorithm.

The wireless communication subsystem serves as the companion system for verifying of the algorithm,completing the transmission and reception of user data.The system consists of four USRP B210 devices and a PC terminal,which are connected via a switch and a gigabit network port.The communication frequency parameters and the channel switching policy can be set in the PC terminal.

The confrontation visualization subsystem consists of a PC terminal,which realizes interface operation and displays using the developed user terminal program.The system can display the received spectrum waveform and analyze the number of data and ACK volumes normally transmitted by the communication in real-time.

Figure 9.The composition of the system.

For the platform system presented in this section,our main contributions are the spectrum sensingprocess and the intelligent decision-making process.

6.1 Design of the Intelligent Jamming System

The spectrum-sensing subsystem contains a USRP2920 device and a PC terminal,which are connected via gigabit Ethernet.The USRP device is driven by USRP universal hardware driver(UHD)to receive the users’signals and the digital signals are processed in the PC terminals.The spectrumsensing system can obtain the spectrum state in real-time by using a wideband fast spectrum sensing technology[40].

The intelligent decision-making subsystem consists of two USRP2920 devices and a PC terminal,and this system is based on the proposed CMJA algorithm for selecting the jamming channels.The system uses the user frequency data sent by the spectrum-sensing module as the current statest=(u1(t),u2(t)),and makes the collaborative jamming channels a=(a1,a2)by searching the collaborative Q table.The system works by online learning,during which the Q-table of each jammer and the collaborative Q-table are updated.

The steps for jammers to release the jamming signals are as follows:

Step 1:The independent Q-table,collaborative Qtable,and reward matrix are initialized.

Step 2:Interaction with the spectrum-sensing system is carried out for obtaining the current communication frequency.

Step 3:Based on the current users’channels,i.e.,the current state,the jammers decide on the communication channel to be jammed in the next time slot on the basis of the collaborative Q-value table.

Step 4:The relevant jamming parameters are configured based on the sensed user information,such as the jamming duration,the jamming frequency interval,the transmission power of the jamming signals,and the jamming gain.

Step 5:The USRP devices are driven to send the jamming signals for accurate jamming through the UHD driver configuration.

Step 6:Step 2 to Step 5are repeated to realize intelligent decision-making.

6.2 Testing and Verification of the Multi-Agent Jamming System

Based on the introduction of the multi-agent jamming system described above,the demonstration verification system built in this work is shown in Figure 10.

This section presents the procedure and the results of the test conducted for verifying the effectiveness of the proposed algorithm.The actual jamming effect are tested by using the communication system that is built as a companion.To accurately evaluate the jamming performance of the algorithm,we define the normalized throughput of the users as the evaluation index of the jamming effect:

Figure 10.The demonstration verification system.

Figure 11.The initial spectrum.

wherePackdenotes the number of packets correctly received at a certain time during the actual transmission andPalldenotes all the packets transmitted during the actual transmission.

In the test,two scenarios are considered for verifying the intelligence and effectiveness of the built multi-agent jamming system by combining the two communication modes mentioned above.The actual communication frequency range for the users is 834-852MHZ,with a frequency interval of 2MHZfor a total of 10 communication channels.The length of each communication time slot is 1s.

The display of the confrontation visualization subsystem is shown in Figure 11 and the figure shows the spectrum waveform of the jamming signal released at the beginning.The orange and the yellow boxes represent the spectrum waveform of the communication signal and jamming signal,respectively,whereas the pink box below represents the user’s normalized throughput,which is used for evaluating the jamming effect.The transmission success rate reaches 100%when the initial jamming signal is not sent and drops when it is jammed.The details of the system test under the two scenarios are given below.

Scenario I:Users use fixed sequence frequency hopping(mode I)to communicate.

To make the test results intuitive and clear,we calculate the normalized throughput after every 5 communication time slots and the results thus obtained are shown in Figure 12(a).To quantitatively analyze the jamming effect of the proposed algorithm,we record and plot the normalized throughput in Figure 12(b).

Figure 12.The test results in scenario I.

Figure 13.The test results in scenario II.

At the beginning of the algorithm execution,the jammers do not learn the user’s communication pattern and release the jamming signals blindly and irregularly.Consequently,the jamming effect is not obvious and the normalized throughput is maintained at a high level.The normalized throughput of the users gradually decreases as the jammers continue learning and iterating,and finally stabilize at a low level as the algorithm converges.From Figure 12(b),it can be seen that the users’normalized throughput is reduced to approximately 8%.The jamming spectrum can precisely suppress the communication spectrum at this time.The throughput curve in the figure illustrates the entire online learning process of the jamming algorithm.The actual platform test results are consistent with the simulation results,which prove the effectiveness of the proposed CMJA algorithm.

Scenario II:Users use probabilistic frequency hopping(mode II)to communicate.The users select to reside in the current channel with a probability of 30%and select the next channel with a probability of 70%.

We calculate the normalized throughput after every 10 communication time slots.The results thus obtained are shown in Figure 13(a)and the recorded data is plotted in Figure 13(b).

The proposed CMJA algorithm is not completely accurate in suppressing the user channels with a probability of 100%when the communication users use the probabilistic frequency hopping mode.This is due to the random nature of the changes in the behavior of the communication users’channels.Based on the current state,the jammers select the actions with the largest Q-value in the collaborative Q-table during the convergence phase of the algorithm.When the users choose to reside in the current channel,the jammers select the next channel with a larger Q-value,which results in a decision error.

As shown in Figure 13(b),the jammers can successfully jam the packets of users with a probability of 55%.Unlike the simulation results,the assumptions in the simulation are ideal,whereas in an actual wireless communication environment,multipath effects and transmission delays exist.This results in a lower jamming success rate in the actual communication test as compared to the simulation results.

6.3 Applications and Perspectives

Traditional communication jamming techniques and equipments have significant capability shortcomings when countering networked systems.The distributed cluster system has strong jamming capability in time domain,frequency domain and spatial domain,and cognitive electronic countermeasures can adopt collaborative jamming technology of multiple jammers,which can be optimized in time domain,frequency domain and spatial domain simultaneously[41,42].The distributed cluster system can integrate multiple distributed jamming subsystems to realize the sharing of spectrum resources and complete the collaborative jamming to the target network system.Therefore,the collaborative jamming algorithm proposed in this paper has a good application prospect in distributed cluster system,which is also one of the effective means to jam the networked communication system.

In addition,the software-defined radio has strong flexibility,allowing new functions to be added by adding software modules.And with strong openness,its hardware can be updated or extended with the development of devices and technology.The SDR technology is currently developing in the direction of miniaturisation,integration and intelligence,and has good prospects in the field of communication countermeasures.For example,the SDR architecture can be deployed in UAV electronic jamming systems,and the UAV electronic jamming system require multiple jammers to work together in a coordinated manner to complete the jamming task.Therefore the proposed multiagent jamming scheme can be used in UAV group collaborative jamming in the future.

In this paper,we investigated the problem of channel selection for multiple intelligent jammers in a multiuser scenario.Firstly,we introduced the multi-agent MDP framework to model and analyze the multi-agent jamming problem.Secondly,a collaborative jamming algorithm based on multi-agent reinforcement learning was proposed,and simulation results showed that the CMJA algorithm outperforms the independent Qlearning algorithm.Finally,verification of the effectiveness of the proposed CMJA algorithm was performed based on the SDR platform.Results from the verification tests showed that the proposed CMJA algorithm can effectively jam multi-user communications through“distributed calculation and collaborative decision”.

It should be pointed out that the proposed CMJA algorithm is based on the Q-learning algorithm,which belongs to the table search-based reinforcement learning method and is unable to solve high-dimensional decision problems in the large-scale scenario.Recently,the mean-field learning method has been widely studied,which may be a feasible way to address the shortcomings of Q-learning in large-scale scenarios.In future work,we consider to model the multi-agent decision making process as a Markov game and solve it using a mean-field multi-agent reinforcement learning algorithm,aiming to realize the fast decision making under large-scale communication confrontation scenarios.

ACKNOWLEDGEMENT

This work was supported by National Natural Science Foundation of China(No.62071488 and No.62061013).

推荐访问:based Collaborative Jamming

本文来源:http://www.zhangdahai.com/shiyongfanwen/qitafanwen/2023/0427/590168.html

  • 相关内容
  • 热门专题
  • 网站地图- 手机版
  • Copyright @ www.zhangdahai.com 大海范文网 All Rights Reserved 黔ICP备2021006551号
  • 免责声明:大海范文网部分信息来自互联网,并不带表本站观点!若侵害了您的利益,请联系我们,我们将在48小时内删除!