Self-Learning Controllers in the Oil and Gas Industry

Recently, solving the optimization-control problems by using artificial intelligence has widely appeared in the petroleum fields in exploration and production. This paper presents the stateof-the-art reinforcement-learning algorithm applying in the petroleum optimization-control problems, which is called a direct heuristic dynamic programming (DHDP). DHDP has two interactive artificial neural networks, which are the critic network (provider a critique/evaluated signal) and the actor network (provider a control signal). This paper focuses on a generic on-line learning control system in Markov decision process principles. Furthermore, DHDP is a model-free learning design that does not require prior knowledge about a dynamic model; therefore, DHDP can be appllied with any petroleum equipment or devise directly without needed to drive a mathematical model. Moreover, DHDP learns by itself (self-learning) without human intervention via repeating the interaction between an equipment and environment/process. The equipment receives the states of the environment/process via sensors, and the algorithm maximizes the reward by selecting the correct optimal action (control signal). A quadruple tank system (QTS) is taken as a benchmark test problem, that the nonlinear model responses close to the real model, for three reasons: First, QTS is widely used in the most petroleum exploration/production fields (entire system or parts), which consists of four tanks and two electrical-pumps with two pressure control valves. Second, QTS is a difficult model to control, which has a limited zone of operating parameters to be stable; therefore, if DHDP controls on QTS by itself, DHDP can control on other equipment in a fast and optimal manner. Third, QTS is designed with a multi-input-multiNo.30(3) 2021 Journal of Petroleum Research & Studies (JPRS) E19 output (MIMO) model for analysis in the real-time nonlinear dynamic system; therefore, the QTS model has a similar model with most MIMO devises in oil and gas field. The overall learning control system performance is tested and compared with a proportional integral derivative (PID) via MATLAB programming. DHDP provides enhanced performance comparing with the PID approach with 99.2466% improvement. زاغلاو طفنلا ةعانص يف يتاذلا ملعتلا تامكحتم صخلملا تناك ءاوس ةيطفنلا تاعانصلا يف لكاشملا لح يف عساو لكشب يعانطصلاا ءاكذلا لمعتسا مت ،ةريخلاا ةنولآا يف ملعتلا يف يعانطصلاا ءاكذلاب ةصصختم ةيمزراوخلا ثدحا لمعتسي ثحبلا اذه .جاتنلاا يف وا بيقنتلا يف ل يرابجلاا ةجمربلا ةيمزراوخ ىمست ةيمزراوخلا هذه نا ثيح ،ةيطفنلا تادعملا ىلع ةرطيسلا و لكاشم لح و ةدقانلا ةكبشلا ىمست امهدحا نيتيبصع نيتكبش نم نوكتت ةيكيمانيدلا ةجمربلا ةيمزراوخ .ةرشابملا ةيكيمانيدلا ا لثمملا ةكبشلا ىمست ةيناثلا ةكبشلا اما ،ءادلاا ميقت ةراشا زهجت يتلا ةرطيسلا ةراشا زهجت يتلاو رطيسملا و اهيلع ةرطيسلا دارملا تادعملل . و ةزهجلاا وأ تادعملل يضايرلا جذومنلا ةفرعم نودب تارطيسملا ميلعتل ةممصم ةيكيمانيدلا ةجمربلا ةيمزراوخ نأ يمزراوخلا اذهل نكمي يا .فوكرام ةيرظن ئدابم بسحب و يرشبلا لخدتلا نودب رشابم لكشب ملعتت يه فيكتت نا ة نوكي يا ،رشبلا ىلا عوجرلا نودب و ةفلتخملا ةيئيبلا طورش نمض و فورظلا ريغت عم تادعملا ىلع ةرطيسلل بعص جذومن رايتخا مت ثحبلا اذه يف .(أطخلا ةراشا ليلقت وا) حبرلا ةراشا ريبكت للاخ نم كلذو يتاذ ملعتلا ا ثلاثل تانازخلا يعابرلا ماظن ىمسي يذلا و ةرطيسلا مظعم يف دجوي تانازخلا يعابرلا ماظن نا :لاوا .بابس نيتخضم عم ةيوس ةطوبرم تانازخ عبرا نم نوكتي ثيح .(ماظنلا ءزجا وا لماك لكشب ماظنلا) ةيطفنلا تلااجملا يئاوه مكحت يذ نامامص اهزربا تامامصلا نم ددع و نيتيئابرهك وه تانازخلا يعابرلا ماظن نا :ايناث .يئابرهك م ماظن نم نوكيف ،ماظنلا ىلع ةرطيسلا نم تنكمت ةيكيمانيدلا ةجمربلا ةيمزاوخ اذا كلذل .ةيلع ةرطيسلا بعص دقع ددعتم ماظن وه تانازخلا يعابرلا ماظن نا :اثلاث .لهسا جذومن يذ ماظن يا ىلع رطيست نا ةيمزراوخلل ةلوهسلا ادعملا و ةزهجلاا مظعم مئاوي وه كلذل ،تاجارخلاا و تلااخدلاا اذه .ةددعتم تاجارخا و تلااخدا تاذ ةيطفنلا ت رطيسم وه ةيعانصلا وا ةيطفنلا تائشنملا و لماعملا يف ايلاح لمعتسي يذلاو رطيسم رهشا عم هرابتخا مت ثحبلا يبسنلا يلماكتلا ةطساوب اهذيفنت مت يتلا و ةيكيمانيدلا ةجمربلا ةيمزراوخ عم هتنراقم و هرابتخا متو يلضافتلا جمانرب نا نيبت ثيح .(ةيعقاولا ةيلمعلا تارابتخلال ادج ةبراقم يطخلالا ليدوملا تاجرخم نا ثيح) بلاتاملا زواجت مدع ثيح نم عرسا و لضفا لكشب تانارخلا يعابر جذومنلا ىلع رطيست ةيكيمانيدلا ةجمربلا ةيمزراوخ تاذلا ملعتلا للاخ نم متت اذه لكو ةريصق ةينمز ةرتفب و ةبولطملا ميقلا لخدتلا نودب و رفصلا نم ملعتلا) ي ىلا تنسحت دق ةيكيمانيدلا ةجمربلا ةيمراوخ لامعتسأب ةلصحتسملا جئاتنلا نا ثيح .(يرشبلا 99.2466 % ةنراقم يبسنلا رطيسم عم يلماكتلا – يلضافتلا . No.30(3) 2021 Journal of Petroleum Research & Studies (JPRS)


No.30-(3) 2021 Journal of Petroleum Research & Studies (JPRS)
E19 output (MIMO) model for analysis in the real-time nonlinear dynamic system; therefore, the QTS model has a similar model with most MIMO devises in oil and gas field. The overall learning control system performance is tested and compared with a proportional integral derivative (PID) via MATLAB programming. DHDP provides enhanced performance comparing with the PID approach with 99.2466% improvement.

I. Introduction
Approximate dynamic programming (ADP) is useful tool to overcome a behavior of nonlinear systems [1]. ADP has three categorizes [2]: heuristic dynamic programming (HDP), dual heuristic programming (DHP) and globalized DHP. ADP has two neural networks: actor and critic to provide optimal control signal and the long-cost value, respectively. If the action-dependent (AD) form is used in ADP (ADHDP for HDP and ADDHP for DHP). ADP is used in many real applications. For instance, [3] presents how control on turbo-generator. [4] shows the ability of DHP to solve swarm robot problems. [5] and [6] illustrated that ADHD P can obtain an optimal path by multi-robot navigation.
Recently, [7] and [8] are used with Atari game to solve many hard problem with huge number of states. All previous ADP approaches are used temporal difference learning algorithm based on Markov decision process. A Markov Decision Process contains a set of model states, a set of actions, and a reward or cost function and system model. The core of Markov decision process is to find a sequence of actions for certain state that make the cost low or long-go reward high. The main purpose and aim of this paper is how using the HDP approach to control on a process of a quadruple-tank system (QTS), which is frequently used in oil and gas industrial. QTS consists of four interconnected tanks and two motorpumps [9]. HDP is used to control voltage of two pumps to follow the desired level (set point level value) of tanks, which is a first approach appearing in the literature. This paper presents a self-learning algorithm to build a controller from scratch without human intervention to control on tanks level of QTS.

II. Devices and experiments
This section presents the aspects of HDP as in [2] and [6] with details of learning of the nonlinear QTS model as in [9].

A. Architecture of The HDP approach
The main block diagram for the featured DHDP illustrates in Figure (

Fig. (1) Block diagram for HDP. u_t is the action vector at time t to control on the motorpumps of QTS that comes from actor neural network (the controller), while the value function ( J_t ), which is single long -cost value, comes from critic neural network. s_t is the input states vector at time t, which is represented by the tank levels. A reinforcement function (r_t) can get from linear quadratic equation. The backpropagation learning path is shown by dashed lines for actor and critic networks.
As shown in Figure (1), the model produces a prediction of the next state and next reward.
HDP uses to solve the Bellman's optimality equation, which is written as [6]. (1) according to Markov decision process principles, the is the optimal value function of the current state ; is the transition probability to move to the next state with action, , that belong to (in this paper, and is the discount factor, which is between 0 and 1. Therefore, The Bellman's optimality equation obtains as follows: The optimal control is given as follows: . As shown in [6], DHDP consists of blocks called the action network and critic network. It also uses online learning for the neural networks. The control signal is generated from actor neural network (controller), which is evaluated by the critic neural network. Both critic and actor have one hidden layer. The temporal difference error for the critic network is defined as: The gradient-based adaptation for the weights update rule in the critic network can be given by , Where, is the learning rate of the critic network at time t, and is the weight vector in the critic network.
While, the weights updating from input to hidden layer are: where is the total number of hidden nodes in the critic network; is the j output of the hidden nodes ; is the sigmoid function; is the row vector for total number of inputs to the critic network which consists of input states concatenated with control signals; ; is the identity matrix, is a diagonal matrix.

No.30-(3) 2021 Journal of Petroleum Research & Studies (JPRS)
E23 As shown in Figure (1), the error between the desired ultimate objected to minimize the actor error (see [2]) and the approximate value function is backpropagated through critic network. The error function of an action network can be defined as .
Therefore, the objective function in the action network is The weight updating in the action network is given as follows: Where the learning rate of the action is network at time t, and is the weight vector in the action network.  Two pumps is used to control on the level in the lower two tanks by input voltages ( and ).
The voltage from level measurement devices are represented the output ( and ). Low of Bernoulli and mass balances deferential equations are given as follows [9]:

III. Plots and Discussion of Simulation Results
In this section, the comparison between the proportional integral derivative (PID) as in [10], and our approach (DHDP). These PID gains are given in the PID transfer function: , Where is proportional gain, is integral gain, is derivative gain, and is the first-order derivative filter gain (for reducing noise and distortions). In this paper, we used two PID controllers (one for pump1 and the other for pump2). The values for these gains are taken from [10] with improvement by using try-and-error method, which are = 3, = 1.

IV. Technical and Economic Feasibility
The mean-squared-error with the PID approach is 0.3849, while the mean-squared-error with the HDP approach is 0.0029. That means, the improvement percentage is 99.2466%, which yields a very efficiency of using electrical power. However, the HDP approach has better results and more reliable to use, but HDP requires building two neural networks and highspeed computer for training and leaning the critic and actor networks. Because of most This paper has presented DHDP for controlling on the well-known device using in the oil and gas industrial, which is QTS. The performance of HDP was excellent during time compared to PID controller. Merging neural network with oil and gas field presents improvement the generalization ability of the system with dealing with dynamic change in the environment. A significant advantage to boost the efficiency of control the level of tanks is demonstrated in this paper.