1. Introduction
The Internet has turned into an essential requirement of any modern society, and almost everybody around the world uses it for daily life activities e.g. communication, shopping, research, etc. The growing number of Internet applications, the large volume of the exchanged information, and the diversity of the services attract the attention of people wishing to gain unauthorized access to these resources, e.g. hackers, attackers, and spammers. These attacks highlight the critical role of information security technologies to protect resources and limit access to only authorized users. To this end, various preventive and protective security mechanisms, policies, and technologies are designed [
1,
2].
Approximately a decade ago, Completely Automatic Public Turing test to tell Computers and Humans Apart (CAPTCHA) was introduced by Ahn et al. [
3] as a challenge-response authentication measure. The challenge, as illustrated in
Figure 1, is an example of Human Interaction Proofs (HIPs) to differentiate between computer programs and human users. The CAPTCHA methods are often designed based on open and hard Artificial Intelligence (AI) problems that are easily solvable for human users [
4].
As attempts for unauthorized access increase every day, the use of CAPTCHA is needed to prevent the bots disrupt the services such as subscription, registration, and account/password recovery, running attacks like spamming blogs, search engine attacks, dictionary attacks, email worms, and block scrapers. Online polling, survey systems, ticket bookings, and e-commerce platforms are the main targets of the bots [
5].
The critical usability issues of initial CAPTCHA methods pushed the cybersecurity researchers to evolve the technology towards alternative solutions that alleviate the intrinsic inconvenience with a better user experience, more interesting designs (i.e. gamification), and support disabled users. The emergence of powerful mobile devices motivated the researchers to design gesture-based CAPTCHAs that are a combination of cognitive, psychomotor, and physical activities within a single task. This class is robust, user-friendly, and suitable for people with disabilities such as hearing and in some cases vision impairment. A gesture-based CAPTCHA works based on the principles of image processing to detect hand motions. It analyzes low-level features such as color and edge to deliver human-level capabilities in analysis, classification, and content recognition.
Recently, outstanding advances have been made in hand gesture recognition. The image processing techniques perform segmentation and feature extraction on the input images and detect the hand pose, fingertips, and hand palm in real time. During the past twenty years, the researchers used color or wired gloves equipped with sensors to detect hand pose. However, the skin-color-based analysis does not require a paper cover or sensors in order to detect fingers and works fast, accurately, and in real time.
To detect the hand gesture, the algorithm should first transform the video frames into two-dimensions images, and then apply segmentation and skin filter functions. In this area, the researchers have presented various methods that utilize machine learning techniques such as Support Vector Machines (SVM) [
6], Artificial Neural Networks (ANN) [
7], fuzzy systems [
8], Deep Learning (DL) [
9], and metaheuristic methods [
10].
2. Background and Related Works
The increasing popularity of the web has made it an attractive ground for hackers and spammers, especially where there are financial transactions or user information [
1]. CAPTCHAs are used to prevent bots from accessing the resources designed for human use e.g. login/sign-up procedures, online forms, and even user authentication [
11]. The general applications of CAPTCHAs include preventing comment spamming, accessing email addresses in plaintext format, dictionary attacks, and protecting website registration, online polls, e-commerce, and social network interactions [
1].
Based on the background technologies, CAPTCHAs can be broadly classified into three classes: visual, non-visual, and miscellaneous [
1]. Each class then may divide into several sub-classes, as illustrated in
Figure 2.
The first class is visual CAPTCHA which challenges the user on its ability to recognize the content. Despite the usability issues for visually impaired people, this class is easy to implement and straightforward.
Text-based or Optic Character Recognition (OCR) based CAPTCHA is the first sub-class of visual CAPTCHAs. While the task is easy for a human, it challenges the automatic character recognition programs by distorting the combination of numbers and alphabets. Division, deformation, rotation, color variation, size variation, and distortion are some examples of techniques used for making the text unrecognizable for the machine [
12]. BaffleText [
12], Pessimal Print [
13], GIMPY and Ez—Gimpy [
14], and ScatterType [
15] are some of the proposed methods in this area. Google reCHAPTCHA v2 [
16], Microsoft Office, and Yahoo! Mail are famous examples of industry-designed CAPTCHAs [
17]. Bursztein et al. [
18] designed a model to measure the security of text-based CAPTCHAs. They identified several areas of improvement and designed a generic solution.
Due to the complexity of object recognition tasks, the second sub-class of visual CAPTCHAs is designed on the principles of semantic gap [
19] – people can understand more from a picture than a computer. Among the various designed image-based CAPTCHAs, ESP-Pix [
3], Imagination [
20], Bongo [
3], ARTiFACIAL [
21], and Asirra [
22] are the more advanced ones. The image-based CAPTCHAs have two variations: 2D and 3D [
23]. In fact, the general design of the methods in this sub-class is 2D; however, some methods utilize 3D images (of characters) as a human is better in recognition of 3D images. Another type of image-based CAPTCHA is a puzzle game designed by Gao et al. [
24]. The authors used variants in color, orientation, and size, and they could successfully enhance the security. They prove that puzzle-based CAPTCHAs need less time than text-based ones. An extended version of [
24] that uses animation effects was later developed in [
25].
Moving object CAPTCHA is a relatively new trend that shows a video and asks the users to type what they have perceived or seen [
26]. Despite being secure, this method is very complicated and expensive compared to the other solutions.
The last sub-class of visual CAPTCHAs, interactive CAPTCHA, tries to increase the user interactions to enhance security. This method requires more mouse click, drag, and similar activities. Some variations request the user to type the perceived or identified response. Except the users that have some sorts of disabilities, solving the challenges are easy for human but hard for computers. The methods designed by Winter-Hjelm et al. [
27] and Shirali-Shahreza et al. [
28] are two interactive CAPTCHA examples.
The second CAPTCHA class contains non-visual methods in which the user is assessed based on audio and semantic tests. The audio-based methods are subject to speech recognition attacks while the semantic-based methods are very secure and much harder to be cracked by computers. Moreover, semantic-based methods are relatively easy for users, even the ones with hearing or visual impairment. However, the semantic methods might be very challenging for users with cognitive deficiencies [
1].
The first audio-based CAPTCHA method was introduced by Chan [
29], and later other researchers such as Holman et al. [
30] and Schlaikjer [
31] proposed more enhanced methods. A relatively advanced approach in this area is to ask the user to repeat the played sentence. The recorded response is then analyzed by a computer to check if the recorded speech is made by a human or a speech synthesis system [
28]. Limitations of this method include requiring equipment such as a microphone and speaker and being difficult for people with hearing and speech disabilities.
Semantic CAPTCHAs are relatively a secure CAPTCHA sub-class as computers are far behind the capabilities of humans in answering semantic questions. However, still they are vulnerable to attacks that use computational knowledge engines, i.e. search engines or Wolfram Alpha. The implementation cost of semantic methods is very low as they are normally presented in a plaintext format [
1]. The works by Lupkowski et al. [
32] and Yamamoto et al. [
33] are examples of a semantic CAPTCHA.
The third CAPTCHA class utilizes diverse technologies, sometimes in conjunction with visual or audio methods, to present techniques with novel ideas and extended features. Google's reCAPTCHA is a free service [
16] for safeguarding websites from spam or unauthorized access. The method works based on adaptive challenges and an advanced risk analysis system to prevent bots access web resources. In reCAPTCHA, the user first needs to click on a checkbox. If it fails, the user then needs to select specific images from the set of given images. Google, in 2018, released the third version i.e. reCAPTCHA v3 or NoCAPTCHA [
34]. This version monitors the user's activities and reports the probability of being human or robot without needing the user to click the “I'm not a robot” checkbox. However, NoCAPTCHA is vulnerable to some technical issues such as erasing cookies, blocking JavaScript, and using incognito web browsing.
Yadava et al. [
35] designed a method that displays the CAPTCHA for a fixed time and refreshes it until the user enters the correct answer. The refreshment only changes the CAPTCHA, not the page, and the defined time makes cracking it harder for the bots.
Wang et al. [
36] put forward a graphical password scheme based on CAPTCHAs. The authors claimed it can strengthen security by enhancing the capability to resist brute force attacks, withstand spyware, and reduce the size of password space.
Solved Media presented an advertisement CAPTCHA method in which the user should enter a response to the shown advertisement to pass the challenge. This idea can be extended to different areas, e.g. education or culture [
1].
A recent trend is to design personalized CAPTCHAs that are specific to the users based on their conditions and characteristics. In this approach, an important aspect is identifying the cognitive capabilities of the user that must be integrated into the process of designing the CAPTCHA [
37]. Another aspect is personalizing the content by considering factors such as geographical location to prevent third-party attacks. As an example, Wei et al. [
38] developed a geographic scene image CAPTCHA that combines Google Map information and an image-based CAPTCHA to challenge the user with information that is privately known.
Solving a CAPTCHA on a small screen of a smartphone might be a hassle for the users. Jiang et al. [
39] found that wrong touches on the screen of a mobile phone decrease the CAPTCHA performance. Analysis of the user's finger movements in the back end of the application can help in differentiating between a bot and a human [
40]. For example, by analyzing the sub-image dragging pattern, their method can detect whether the action is “BOTish” or “HUMANish”. Some similar approaches like [
41,
42] challenge the user for finding segmentation points in cursive words.
In the area of Hand Gesture Recognition (HGR), the early solutions were using a data glove, hand belt, and camera. Luzhnica
et al. [
43] used a hand belt equipped with a gyroscope, accelerometer, and Bluetooth for HGR. Hung
et al. [
44] acquired the input required data from hand gloves. Another research has used Euclidean distance for analyzing twenty-five different hand gestures and employed an SVM for classification and controlling tasks [
45]. In another effort [
46], the researchers converted the Red, Green, and Blue (RGB) captured images to grayscale, applied a Gaussian filter for noise reduction, and fed the results to a classifier to detect the hand gesture.
Chaudhary
et al. [
47] used a normal Windows-based webcam for recording user gestures. Their proposed method extracts the Region of Interest (ROI) from frames and applies a Hue, Saturation, Value (HSV)-based skin filter on RGB images in particular illumination conditions. To help the fingertip detection process, the method analyzes hand direction according to pixel density in the image. It also detects the palm hand based on the number of pixels located in a 30×30-pixel mask on the cropped ROI image. The method has been implemented by a supervised neural network based on Levenberg–Marquardt algorithm and uses eight thousand samples for all fingers. The architecture of the method has five input layers for five input finger poses, and five output layers for the bent angle of the fingers. Marium
et al. [
48] extracted the hand gesture, palms, fingers, and fingertips from webcam videos using the functions and connectors in OpenCV.
Simion
et al. [
49] researched finger pose detection for mouse control. Their method, in the first step, detects the fingertips (except the thumb) and palm hand. Its second step is to calculate the distances between the pointing and middle fingers to the centre of the palm, and the distance between the tips of the pointing and middle fingers. The third step computes the angles between the two fingers, between the pointing finger and the x-axis, and between the middle finger and the x-axis. The researchers used six hand poses for mouse movements including right/left click, double click, and right/left movement.
The main contribution of this paper is the development of a novel hand-gesture-controlled CAPTCHA for mobile devices that is easy to use even for people with visual challenges. Choosing a set of easy-to-imitate hand gestures enhances the usability of the method and minimizes the number of reattempts. The originality of the method is extended by detecting the bending angle of human fingers through applying Genetic Algorithm (GA) principles on a Multi-Layer Perceptron (MLP). The steps include detecting the palm center and fingertips in real time, defining the distance between the palm center and each fingertip, and calculating the bending degree of fingers based on the detected distances. It then detects the bending degree of human fingers and accordingly the hand gesture.
In what followed next, in section 3, we first review the structure of the proposed method including the processes of hand detection based on skin color, detecting hand pose, palm, fingertips and fingers' bend angle.
Section 4 covers implementation and experiments. Analysis, comparison, and discussion are located in section 5, and the work is concluded in section 6.
3. The Proposed Method
Gestural CAPTCHA, hereafter called GAPTCHA, shows a set of simple hand poses to the user and then requests the user to imitate these gestures in front of the camera. As illustrated in
Figure 3, in this method, the essence of recognizing the gestures is calculating the fingers' bent angles. The method preprocesses and applies a skin filter on the input 2D image. The employed segmentation method extracts the hand pose from an image, even if the background contains skin-colored objects. It focuses only on the extracted areas of interest in the image for faster and more accurate analysis. The method first detects the fingertips and palm in the segmented image and then calculates the distance between the center of the palm and each fingertip. It calculates the bending angle of fingers based on the distances and detects fingers, palm, and angles with no errors. The proposed method can work with a normal quality webcam for capturing the user's hand movements, and no marked gloves, wearable sensors, or even a long sleeve is required. It also has no limitation for the camera angle.
The most famous segmentation method in HGR systems is skin color, as it is invariant to the changes like size, rotation, and movements. The captured webcam images are in an RGB color system and the involved parameters in this format are highly correlated and sensitive to illumination. However, the HSV system separates the color and illumination information. Therefore, in skin-color-based HGR systems, the captured RGB images are first converted to the HSV system and then segmented to generate the binary version of the image. Each pixel of a binary image is stored in one bit, and due to the skin-colored pixels in the background, these images normally carry some noise. As shown in
Figure 4, this noise creates unwanted spots in the image. A median filter then removes the isolated spots and a dilation operator fills the regions.
The proposed method supports freedom of angle for the camera and user, and the only limitation is to face the palm hand to the camera. The method uses contours to extract the hand image from the binary image. A contour is a list of the points that represent a curve in the image and its main application is in the analysis and detection of shapes. Here, FindContours in OpenCV is used to find the largest contour in the binary image – the hand shape shown in
Figure 5 section A. Finding the center of the contour helps in finding the center of the palm hand (
Figure 5, section B). To this end, using the BoundingRect function in OpenCV, a bounding box is applied on the hand shape to find the palm hand center.
The ConvexHull algorithm is employed to identify fingertips. It returns a set of polygons in which the corners of the largest one represents the fingertips –
Figure 5 section C. To automate the process, ConvexityDefects function approximates the gaps between the contour and the polygon by straight lines. The output of this function is multiple records of four fields: (i) the starting defect point, (ii) the ending defect point, (iii) the middle (farthest) defect point that connects starting and ending points, and (iv) the approximate distance to the farthest point. A sample output of this step is shown in
Figure 5 section D. Each record results in two lines: a line from starting point to the middle point, and a line from the middle point to the endpoint. However, the function may return more points than the number of fingertips. The steps for filtering the detected points and finding the correct location of fingertips include (i) calculating the internal angle between two defect areas in a certain period, (ii) calculating the angle between the starting point and contour center in a certain period, and (iii) measuring the line length to remain below the defined threshold. The green points in section E of
Figure 5 represent the results of this step.
Eq. 1, 2, and 3 are utilized to calculate the length of the produced vectors from start (point
), middle (point
), and end (point
) points, and Eq. 4 computes the angle.
,
, and
are length of the vectors calculated by convexityDefects function.
Calculating the angle between the first point of defect area and the center of the contour is an essential step for removing the non-fingertips areas. The consequent step is to filter the points to the ones between -30 to +160 degrees. Eq. 5 returns the degree between the first point of the defect area and center of the contour, and Eq. 6 calculates the Euclidean distance – the vector length – between first to middle points. The whole process from finding the hand contour until detecting palm hand and fingertips is illustrated in
Figure 5. In this equation, the word
center represents the center of the contour i.e. center of the palm hand.
As already explained, this research uses a combination of MLP and GA for predicting the bent angle of fingers. The used MLP has a flexible structure in which neurons are located in hidden layers and each neuron can influence its input and produce the desired output. To this end, a weight applies to the input value of each neuron and the result passes into an activation function together with a bias value. Reducing classification errors and correct prediction of the pose in an artificial neural network depends on selecting optimum weight and bias values.
Figure 6 illustrates the detailed structure of the proposed method for applying GA principles on MLP to accurately detect the finger's bent angle, and accordingly recognize the hand gesture.
In GAPTCHA, the structure of each chromosome is its weight and bias in the neural network. This can be shown in the format of
in which
is weight of the
ith entry into the
jth neurone of the hidden layer,
is the output, and
and
are the threshold values of the hidden and output layers, respectively. Upon defining the encoding system and the method of converting each answer to a chromosome, the next step is to produce the initial chromosome population. Normally, generating the initial population is a random process, however, heuristic algorithms can accelerate and optimize it. GAPTCHA uses a roulette wheel mechanism for selecting parents in mutation and crossover processes. In this mechanism, the probability of selecting a chromosome depends on how suitable it is for the evaluation function. In other words, the higher quality of the chromosome, the higher chance to be selected for producing the next generation, and vice versa. Eq. 7 calculates the chance of selecting a chromosome in the roulette wheel.
In the above equation, the probability (p) of selecting the ith chromosome is the proportion of the fitness function value of chromosome i to the sum of fitness function values of all chromosomes. Single-point crossover is used to apply crossover in this method. It chooses a random point in the chromosomes and swaps the information in the remaining parts. The steps to apply single-point crossover are:
- i.
Choose a random chromosome value between 0 and 1,
- ii.
Go to the next step and mutate if the number is bigger than the mutation threshold (a value between 0 and 1), otherwise skip mutation,
- iii.
Choose a random number that indicates one of the chromosome genes and makes a numerical mutation.