I am documenting my almost 3-year journey to build a usable question answer cyber security bot. I had begun dabbling with robotics and artificial intelligence and one of the first things I decided to build was a chatbot.
I did not set out to emulate any of the existing voice assistants. I wanted to build a cyber security chatbot to showcase my skills and get a handle on text processing.
Three years because life happened and I took my time over the bot. There were stretches when the bot languished.
My bot criteria
- I did not want to use any of the free platforms. My bot would be under the control of someone else.
- I wanted full control over the content.
- I should be able to explain to someone how the bot works.
- The database of the bot should be easy to update.
- I should be able to scale the bot.
The data set
The first thing I had to do was to get a data set of questions. I wanted the bot to emulate a chief information security officer. I was unable to find a list of commonly asked questions but I did find a list of interview questions. They became my first entries in the database of the bot.
The next set of data was common information security vocabulary. A chief information security officer should know the meanings of common security terms.
I took the following Glossary of Security Terms
adapted it to my needs and loaded it into the database.
The artificial intelligence markup language
The next problem was platform. How do I build the bot? That is when I discovered the artificial intelligence markup language also known as aiml.
I was fortunate in that I could also use my writing skills here. Building the bot in aiml did involve some technical work but it was more of writing because I had to think about how a single question could be asked in several different ways.
I quickly realized that aiml 1.0 would not do the job for me. I needed more features when I found aiml version 2. At that time, it was under development and not a full standard. I had to find an interpreter that would support it.
Enter Program-Y
I settled on Program-Y because I found it the easiest to use. I could use existing aiml such as that for insult detection and did not need to write any adapters like I had to do with Chatter Bot
I may revisit this library because in researching for this article, it seems to have become easier to use and I have learnt more than I did at that time.
See some of the basic code I put together while experimenting with the library.
from chatterbot import ChatBot import logging logging.basicConfig(level=logging.INFO) chatbot = ChatBot( 'CISO', storage_adapter='chatterbot.storage.SQLStorageAdapter', database='./database.sqlite3', trainer='chatterbot.trainers.ChatterBotCorpusTrainer', preprocessors=[ 'chatterbot.preprocessors.clean_whitespace' ], logic_adapters=[ { 'import_path': 'chatterbot.logic.BestMatch' }, { 'import_path': 'chatterbot.logic.LowConfidenceAdapter', 'threshold': 0.65, 'default_response': 'unknown.' } ] ) #chatbot.train("/home/pi") # Get a response to an input statement resp=chatbot.get_response("what is cyber security") print(resp)
Safety
I wanted the bot to be as safe as possible. I wanted it to handle as many unusual inputs as I could think off. People frequently insult bots. I wanted to catch this use case if nothing else. This is where I used the freely available insult aiml from one of the initial aiml bots made by Dr. Richard Wallace. This file had a surprising array of insults which I was able to import directly into the bot.
The unknown question
As the bot grew, my ambitions grew too. I next wanted the bot to handle questions it did not know. It was too easy to say something like”I’ll contact my bot master for more information.” That is when I discovered the ability to add web services in program-y to query duckduckgo. Duckduckgo was the only search engine that gave me a free API for this kind of task.
Restricting the domain to information security
At this point, I had a bot but the bot would answer any question. I could ask it “what is cooking” and it would get the answer from duckduckgo and tell me. Yes, I am sure cisos could be good cooks but I wanted to restrict the bot. The only way I could think of doing that was to do a vocabulary check. If there were information security words, then the bot would continue and search for an answer.
Detecting questions
My bot was not equipped to handle statements. People frequently make statements as questions. I did not have a mechanism to handle this situation. This is where I began to get into natural language processing. I used the natural language toolkit or nltk to check what parts of speech constitueted my ideal question. I then set it to detect that pattern. The bot would ask the user to enter only questions.
Here is the code.
import logging from programy.extensions.base import Extension import nltk import difflib from nltk.tag import pos_tag, map_tag class CheckIsQuestion(Extension): def isQuestion(self,q): text = nltk.word_tokenize(q) posTagged = pos_tag(text) simplifiedTags = [(word, map_tag('en-ptb', 'universal', tag)) for word, tag in posTagged] onlytags=[] for lc in range(0,len(simplifiedTags)): onlytags.append(simplifiedTags[lc][1]) logging.debug("computing done") sqp=[["pron","verb","noun"],["pron","verb","det","noun"]] res=False for m in sqp: qm=difflib.SequenceMatcher(None, str(onlytags)[1:].lower(), str(m).lower()) r=qm.real_quick_ratio()*100 if r>97: res=True break else: res=False return res def execute(self, bot, clientid, data): if logging.getLogger().isEnabledFor(logging.DEBUG): logging.debug("question detected") qs=self.isQuestion(data) result="" if qs==True: result="question found" else: result="question not found" return result+" "+data
Deploying the bot
I was finally ready to publish the bot. However, I ran into an issue. The program-y environment only exposed itself on local host. The answer was to proxy it via nginx
The code
Enough theory for now, so what does the code look like?
Let me start with the python code to look up an unknown answer at duckduckgo. The primary brain of the bot references this code.
""" import logging import json from programy.services.service import Service from programy.services.requestsapi import RequestsAPI class DuckDuckGoAPI(object): def __init__(self, request_api=None): if request_api is None: self._requests_api = RequestsAPI() else: self._requests_api = request_api # Provide a summary of a single article def ask_question(self, url, question, num_responses=1): # http://api.duckduckgo.com/?q=DuckDuckGo&format=json payload = {'q': question, 'format': 'json'} response = self._requests_api.get(url, params=payload) if response is None: raise Exception("No response from DuckDuckGo service") if response.status_code != 200: raise Exception("Error response from DuckDuckGo service [%d]"%response.status_code) json_data = json.loads(response.text) if 'RelatedTopics' not in json_data: raise Exception("Invalid response from DuckDuckGo service, 'RelatedTopcis' missing from payload") topics = json_data['RelatedTopics'] if len(topics) == 0: raise Exception("Invalid response from DuckDuckGo service, no topics in payload") if len(topics) < num_responses: num_responses = len(topics) responses = [] for i in range(num_responses): if 'Text' in topics[i]: sentences = topics[i]['Text'].split(".") responses.append(sentences[0]) return ". ".join(responses) class DuckDuckGoService(Service): def __init__(self, config=None, api=None): Service.__init__(self, config) if api is None: self._api = DuckDuckGoAPI() else: self._api = api self._url = None if config._url is None: raise Exception("Undefined url parameter") else: self._url = config.url def ask_question(self, bot, clientid: str, question: str): try: return self._api.ask_question(self._url, question) except Exception as e: if logging.getLogger().isEnabledFor(logging.ERROR): logging.error("General error querying DuckDuckGo for question [%s] - [%s]"%(question, str(e))) return ""
Do read the program-y documentation for details on how this service is constructed.
We now come to the brain of the bot.
This file helps catch insults.
Be warned, actual insults are included here.
My logic is simple. I have all insults map to a single template where I output a standard response. I have distinguished between profanity and insults to cater for a future enhancement where I may need to differentiate between being rude and being abusive but as of this writing, this has not been necessary to do.
<?xml version="1.0" encoding="UTF-8"?> <aiml> <!-- File: ciso.aiml --> <!-- Author: Mr. Pranav Lal --> <!-- Last modified: September 23, 2016 --> <!-- --> <!-- This AIML file is part of the CISO 1.0 chat bot knowledge base. --> <!-- --> <!-- The CISO brain is Copyright © 2016 by security-writer. --> <!-- --> <!-- The CISO brain is released under the terms of the GNU Lesser General --> <!-- Public License, as published by the Free Software Foundation. --> <!-- --> <!-- This file is distributed WITHOUT ANY WARRANTY; without even the --> <!-- implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. --> <!-- --> <!-- For more information see http://www.security-writer.com --> <!--The profanity and insults sections have been taken from super's brain which comes with Program-AB--> <!--categories and topics to hamdle unknown values--> <!-- bad language protection starts here--> <category> <pattern>ARE YOU A WHORE</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>ARE YOU A BITCH</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>YOU F * * * * * * *</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>YOU IDIOT</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>YOU ARE A B * * * *</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>YOU ARE A B * * * * *</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>YOU ARE A C * * *</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>YOU ARE A DICK</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>YOU ARE A BITCH</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>F U *</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>F * * * YOU BITCH</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>B * * * * YOU *</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>C * * *</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>S * * * *</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>GO FUCK YOURSELF</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>GO DIE IN A HOLE</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>GO TO HELL *</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>DUMB ASS</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>PUSSY</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>FUCK U BITCH</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>FUCK IT</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>FUCK YOU THEN</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>BITCH</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>BITCH PLEASE</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>BITCH *</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>NO BITCH</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>NO DUMBASS *</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>SHUT UP BITCH</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>YEAH BITCH</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>EAT MY *</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>WHORE</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>HEY B * * * *</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>HEY BITCH</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>DICK HEAD</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>ASSHOLE</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>DUMBASS</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>BASTARD</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>STUPID BITCH</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>FAGGOT</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>WHAT IS UP BITCH</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>YOUR A BITCH</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>SLUT</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>GAY</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>HOE</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>YOU ARE FAT</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>YOU ARE CRAZY</pattern> <template> <srai>FILTER INSULT</srai> </template> </category> <category> <pattern>FILTER INSULT</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>DO YOU WANT TO HAVE SEX</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>DO YOU WANT TO HAVE SEX WITH ME</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>DO YOU WATCH PORN</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>DO YOU WANNA HAVE SEX</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>DO YOU WANNA HAVE SEX *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>DO YOU SWALLOW *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>DO YOU HAVE SEX</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>DO YOU HAVE BIG BOOBIES</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>DO YOU SUCK DICK</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>DO YOU LIKE A * *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>DO YOU LIKE TO HAVE SEX</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>DO YOU LIKE TO SUCK *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>DO YOU LIKE SEX</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>DO YOU LIKE PORN</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>DO YOU LIKE PENIS</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>DO YOU KNOW ANY DIRTY JOKES</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>ARE YOU SEXY</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>ARE YOU HORNY</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>PORN</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>TELL ME A DIRTY JOKE</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SHOW ME A VAGINA</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SHOW ME A PICTURE OF NUDE *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SHOW ME A PICTURE OF BOOBIES</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SHOW ME A NAKED *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SHOW ME PORN</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ TEENBLOWJOB *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SHOW ME SOME PORN</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SHOW ME TITS</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SHOW ME NUDE *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SHOW ME VAGINA</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SHOW ME YOUR VAGINA</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>YOU TURN ME ON</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>I WANT SEX</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>I WANT TO F * *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>I WANT TO RAPE YOU</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>I WANT TO HAVE SEX</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>I WANT TO MAKE LOVE TO YOU</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>I HAVE A BONER</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>I AM HORNY</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>I AM HORNY *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>I LIKE SEX</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>I LIKE TO HAVE SEX *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>I LIKE RAPE</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>I WILL FUCK YOU</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>I WANNA HAVE SEX</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>I WANNA HAVE SEX *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>I WANNA FUCK YOU</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>I FUCKED *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>I NEED SEX</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>T * * * *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SEXUAL *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ SEXFUCK *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ SEXY *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ BOOBS</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ NUDEGIRL *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ GIRLSEX *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ TO FUCK YOU</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ ANAL SEX</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ HOGTIED *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ TEEN UNDERSCORE *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ PUSSY</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ YOUR PUSSY</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ YOUR ASS</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ YOUR BREASTS</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ BLOWJOB</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ DOWNBLOUSE *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ SEX *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ SEX WITH YOU</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ ASSHOLE</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ SEXGURL *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ NICE ASS</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ NEKKID GIRL *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ HORNY *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ WHALETAILS *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ MY DICK</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ MY PENIS</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ HENTAI</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ MASTURBATE</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ BLOWJOBGIRL *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ ATK GALLERIA *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ RAPE YOU</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ HAVE SEX *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ UPSKIRT *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ STRIPPER</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ DOWN BLOUSE *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ GIRLSPUSSY *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ YOU NAKED</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ SUCK MY DICK</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ BIG DICKS</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>MASTURBATE</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>PORNO</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>VAGINA</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>ANUS</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>WOULD YOU LIKE TO HAVE SEX</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>HAVING SEX</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>HAVING SEX WITH *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>FUCK ME</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>FUCK ME *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>FUCK MY *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>WANT TO HAVE SEX</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>LETS HAVE SEX</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>LETS FUCK</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>PORNHUB *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>COCK *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>MY PENIS</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>MY PENIS *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>MY PENIS IS *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>MY DICK</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>MY DICK IS *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>CAN WE HAVE SEX</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>CAN YOU TELL ME A DIRTY JOKE</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>CAN YOU HAVE SEX</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>CAN YOU SUCK MY *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>CAN YOU GIVE ME A BLOWJOB</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>CAN YOU TALK DIRTY TO ME</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>CAN I F * * * YOU *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>CAN I FUCK YOU</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>CAN I SUCK ON YOUR *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>LICK MY *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>IM GOING TO FUCK YOU</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>IM CUMMING</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>BUTT *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>GIVE ME A BLOWJOB</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>HORNY *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>LET US TALK ABOUT SEX</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>BEND OVER</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SUCK ON MY *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SUCK A DICK</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SUCK A DICK *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SUCK A *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SUCK MY</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SUCK MY COCK *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SUCK MY DICK</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SUCK MY DICK *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SUCK MY *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SUCK IT *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>PENIS</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>PENIS IN *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>BOOBS</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>TAKE OFF YOUR *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>WILL YOU SUCK MY PENIS</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>BLOW ME *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>BLOW JOB</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>TALK DIRTY TO ME</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SEX</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>RAPE</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>WHAT IS A BLOWJOB</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>BOOBIES</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>YOUR ASS</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>KISSES YOUR *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SEXY *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>HOW DO YOU HAVE SEX</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>HOW BIG IS YOUR PENIS</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>HAVE SEX</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>HAVE SEX *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>HAVE SEX WITH ME</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>HAVE SEX WITH *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>MAKE ME CUM</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>WANNA FUCK</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>WANNA SUCK MY *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>PORNOS</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SHOW ME A PICTURE OF A NAKED *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>I LIKE VAGINA</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>DICK</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SEX *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SEXY</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ YOUR PENIS</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>FIND A DICK</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>FIND A F * * * * * * JOB</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ GAY PORN *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ P * * * * *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>FIND A NAKED WOMAN</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ HENTAI *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ MISTY DAWN</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>_ HAVE SEX</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>FIND A SEXY *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>YOU ARE SEXY</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SUCK ME</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>OPEN YOUR *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>HORNY</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SEARCH PORN</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>DO YOU LIKE BIG DICKS</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>SHOW ME TEENPUSSY *</pattern> <template> <srai>FILTER INAPPROPRIATE</srai> </template> </category> <category> <pattern>FILTER INAPPROPRIATE</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>YOU FUCKING *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>YOU ARE FUCKING STUPID</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>YOU ARE FUCKING *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>F U</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>F YOU *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>F OFF</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>F * * *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>F * * * THAT *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>F * * * *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>F * * * OFF</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>F * * * YOU</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>F * * * YOU F * * * YOU</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>F * * * YOU YOU *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>F * * * * * *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>B * * * *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>I WANT TO F * * * YOU</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>I DO NOT GIVE A F * * * *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>_ FUCK *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>_ FUCK YOU</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>_ NIGGER</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>GO F * * * YOURSELF</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>GO F * * * YOURSELF *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>GO F * * * *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>SHIT</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>FUCKING</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>FUCK</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>FUCK U</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>FUCK YOUR *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>FUCK A *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>FUCK YEAH</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>FUCK *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>FUCK OFF</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>FUCK YOU</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>FUCK YOU *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>FUCK YOU BITCH</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>BYE BITCH</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>CUNT</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>NO FUCK YOU</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>SHUT UP B * * * *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>SHUT THE F * * * UP</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>SHUT THE F * * * UP *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>SHUT THE FUCK UP</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>SHUT THE FUCK UP BITCH</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>NIGGER</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>IM FUCKING *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>HEY BITCH *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>SAY FUCK YOU *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>SAY BITCH</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>WHAT THE F * * *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>WHAT THE F * * * ARE YOU TALKING ABOUT</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>WHAT THE F * * * *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>WHAT THE F * * * IS *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>WHAT THE FUCK</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>WHAT THE FUCK *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>F * * * YOU *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>F YOU</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>FUCKING *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>F * * * * * * *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>SHIT *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>SCREW YOU</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>URINE IDIOT</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>FU</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>FUCK THAT *</pattern> <template> <srai>FILTER PROFANITY</srai> </template> </category> <category> <pattern>FILTER PROFANITY</pattern> <template>Grow up. I am sure your vocabulary is not that limited.</template> </category> </aiml>
We now come to the main ciso.aiml file. This has a list of question patterns and suitable responses. In many cases, I have routed responses of similar questions to one main answer.
I have also made extensive use of sets to handle synonyms.
In addition, see the way I have used topics to handle situations of unknown answers. Topics in aiml are the same as intents in other frameworks that allow for chat bot creation.
One last thing, besides duckduckgo, I did use the supplied wikipedia service also to handle questions.
However, both these approaches fell flat when the bot was asked a question like “is aadharr secure?” Aadhar cards are a form of universal identification documents in India. None of these services had the answer. I would probably have to write another topic and use a web service to extract verbs, nouns etc., from the question and search and assemble the answer using text summarization.
<?xml version="1.0" encoding="UTF-8"?> <aiml> <!-- File: ciso.aiml --> <!-- Author: Mr. Pranav Lal --> <!-- Last modified: September 23, 2016 --> <!-- --> <!-- This AIML file is part of the CISO 1.0 chat bot knowledge base. --> <!-- --> <!-- The CISO brain is Copyright © 2016 by security-writer. --> <!-- --> <!-- The CISO brain is released under the terms of the GNU Lesser General --> <!-- Public License, as published by the Free Software Foundation. --> <!-- --> <!-- This file is distributed WITHOUT ANY WARRANTY; without even the --> <!-- implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. --> <!-- --> <!-- For more information see http://www.security-writer.com --> <!--The profanity and insults sections have been taken from super's brain which comes with Program-AB--> <!--Handling cases where the bot does not know an answer--> <category> <pattern>*</pattern> <template> <srai>ASKQ <star></star></srai> </template> </category> <category> <pattern>ASKQ *</pattern> <template> <srai> <extension path="programy.extensions.questiondetect.CheckIsQuestion.CheckIsQuestion"> <star></star> </extension> </srai> </template> </category> <category> <pattern>question found *</pattern> <template> <srai>ASKWIKIPEDIA <star></star></srai> </template> </category> <category> <pattern>question not found *</pattern> <template> Ask me a simple direct question </template> </category> <category> <pattern>ASKWIKIPEDIA *</pattern> <template> <srai>YASKWIKIPEDIA <star></star></srai> </template> </category> <category> <pattern>YASKWIKIPEDIA *</pattern> <template> <sraix service="WIKIPEDIA"><star></star></sraix> </template> </category> <category> <pattern>YEMPTY</pattern> <template> I do not know but will think about it </template> </category> <!--end of categories and topics to hamdle unknown values--> <!--infosec part starts here--> <category> <pattern>WHAT IS <set name="SECURITYSYNONYMS"></set> SECURITY</pattern> <template> The collection of technologies, processes, and practices that protect networked computer systems from unauthorized use or harm. </template> </category> <category> <pattern>WHAT IS INFORMATION SECURITY</pattern> <template>information security relates to the confidentiality, integrity and availability of data irrespective of whether it is physical or digital.</template> </category> <category> <pattern>WHAT IS A CYBER ATTACK</pattern> <template>a cyber attack is An offensive action by a malicious actor that is intended to undermine the functions of networked computers and their related resources, including unauthorized access, unapproved changes, and malicious destruction.</template> </category> <category> <pattern>WHAT ARE SOME EXAMPLES OF CYBER ATTACKS</pattern> <template>a denialof service attack, man in the middle attack, a distributed denial of service attack, brute forcing of credentials, hash collisions</template> </category> <category> <pattern>WHAT ARE THE PROSPECTS IN CYBER SECURITY</pattern> <template>the prospects are good especially if you can be specific in one area such as penitration testing, code quality review etc at least to start with. also see<a href="http://krebsonsecurity.com/tag/security-career-advice/">This collection of advice from Brian Crebs</a></template> </category> <category> <pattern>HOW DO I LEARN CYBER SECURITY</pattern> <template>start by reading and then setup a lab where you can try fundamentals in a safe environment</template> </category> <category> <pattern>WHAT IS A CYBER THREAT</pattern> <template>a cyber threat is the probability of a cyber attack</template> </category> <category> <pattern>WHAT IS A RISK ASSESSMENT</pattern> <template>estimating the probability of a threat agent exploiting that threat</template> </category> <category> <pattern>WHAT IS A THREAT AGENT</pattern> <template>a threat agent is an acter who will exploit a threat</template> </category> <category> <pattern>WHAT IS HACKING</pattern> <template>see this page for a definition of hacking. <a href="http://www.dictionary.com/browse/hacking">hack-ing</a> in the context of computers, it commonly refers to gaining unauthorized access to a computer. However, it is also used to indicate a bit of work that involves deep knowledge of computer systems or that involves using programming tricks</template> </category> <category> <pattern>WHAT SKILLS DO I NEED TO BECOME A HACKER</pattern> <template>curiosity and persistance. Technology skills depend on what you want to hack. For example, if you want to hack a web application, learn how HTML and underlying frameworks function. If you want to break into a linux machine, learn protocol stacks and the workings of operating systems</template> </category> <category> <pattern>WHAT IS THE BEST WAY TO LEARN HACKING?</pattern> <template>Setup a home lab using virtual machines and read books that provide hands on examples and try them</template> </category> <category> <pattern>what keeps you awake at night</pattern> <template> One of the things that I do is to constantly look for enterprise wide patterns of events that are precursers to an attack. These go beyond individual tools. Thret intelligence plays a key role here along with event corelation. I am not only talking about feeds that tools provide but also tracking social media and the conversations that stake holders are having about the company. I have an incident management framework in place to handle a crisis.</template></category><category><pattern>How do you stay updated</pattern><template>I track several blogs and industry forums as well as social media to stay updated on vulnerabilities, large attacks irrespective of thevictim's domain. I maintain a strong interface with product vendors to stay updated with the latest fixes in their products. </template> </category> <category> <pattern>WHO IS A SCRIPT KIDDY</pattern> <template>A person who does not have deep knowledge of systems but uses tools to break into computer systems or networks</template> </category> <category> <pattern>WHO IS A WHITE HAT *</pattern> <template>An ethical hacker who utilizes his skills to help</template> </category> <category> <pattern>WHO IS A BLACK HAT HACKER *</pattern> <template>Most hackers do not wear hats. a black hat hacker is an individual who breaches computers or network for malicious ends.</template> </category> <category> <pattern>WHO IS A GRAY HAT HACKER *</pattern> <template>a gray hat hacker is an individual who make at times break the law or violate typical standards of behaviour when breaching computer systems. However, the intent behind the the violations is not malicious.</template> </category> <category> <pattern>what is cyber security</pattern> <template> Cyber security deals with protecting digital information usually on the Internet against attacks orcestrated via other computers or networks of computers or similar devices. </template> </category> <category> <pattern>WHAT IS IT SECURITY</pattern> <template> <srai>WHAT IS CYBER SECURITY</srai> </template> </category> <category> <pattern>WHAT IS CLOUD COMPUTING</pattern> <template>Cloud computing is an umbrela term that means different things to different people. in its simplest form, it means on demand access to computing resources that is elastic and scalable without human intervention. This also means that the use of these resources is measured.<a href="http://www.service-architecture.com/articles/cloud-computing/cloud_computing_definition.html">Cloud Computing Definition</a></template> </category> <category> <pattern>WHO IS A HACKER</pattern> <template>An individual who uses system features in a way they were not designed for or someone who knows a system intimately and is able to use that knowledge in novel ways. Yes, some individuals do break into computer systems.</template> </category> <category> <pattern> what is Access Control</pattern> <template> Access Control ensures that resources are only granted to those users who are entitled to them. </template> </category> <category> <pattern> what is an Access Control List or ACL</pattern> <template> A mechanism that implements access control for a system resource by listing the identities of the system entities that are permitted to access the resource. </template> </category> <category> <pattern> what is an Access Control Service</pattern> <template> A security service that provides protection of system resources against unauthorized access. The two basic mechanisms for implementing this service are ACLs and tickets. </template> </category> <category> <pattern> what is Access Management</pattern> <template> Access Management is the maintenance of access information which consists of four tasks: account administration, maintenance, monitoring, and revocation. </template> </category> <category> <pattern> what is an Access Matrix</pattern> <template> An Access Matrix uses rows to represent subjects and columns to represent objects with privileges listed in each cell. </template> </category> <category> <pattern> what is Account Harvesting</pattern> <template> Account Harvesting is the process of collecting all the legitimate account names on a system. </template> </category> <category> <pattern> what is ACK Piggybacking</pattern> <template> ACK piggybacking is the practice of sending an ACK or acknowledgement inside another packet going to the same destination. </template> </category> <category> <pattern> what is Active Content</pattern> <template> Program code embedded in the contents of a web page. When the page is accessed by a web browser, the embedded code is automatically downloaded and executed on the user's workstation. Examples of active content technologies include Java applets, ActiveX controls </template> </category> <category> <pattern> what are Activity Monitors</pattern> <template> Activity monitors aim to prevent virus infection by looking for for malicious activity on a system, and blocking that activity when possible. </template> </category> <category> <pattern> what is Address Resolution Protocol or ARP</pattern> <template> Address Resolution Protocol (ARP) is a protocol for mapping an Internet Protocol address to a physical machine address that is recognized in the local network. A table, usually called the ARP cache, is used to maintain a correlation between each MAC address and its corresponding IP address. ARP provides the protocol rules for making this correlation and providing address conversion in both directions. </template> </category> <category> <pattern> what is Advanced Encryption Standard or AES</pattern> <template> An encryption standard being developed by NIST. Intended to specify an unclassified, publicly-disclosed, symmetric encryption algorithm. </template> </category> <category> <pattern> what is an Algorithm</pattern> <template> A finite set of step-by-step instructions for a problem-solving or computation procedure, especially one that can be implemented by a computer. </template> </category> <category> <pattern> what is an Applet</pattern> <template> Java programs; an application program that uses the client's web browser to provide a user interface. </template> </category> <category> <pattern> what is the ARPANET</pattern> <template> Advanced Research Projects Agency Network, a pioneering packet-switched network that was built in the early 1970s under contract to the US Government, led to the development of today&#039;s Internet, and was decommissioned in June 1990. </template> </category> <category> <pattern> what is Asymmetric Cryptography</pattern> <template> Public-key cryptography; A modern branch of cryptography in which the algorithms employ a pair of keys (a public key and a private key) and use a different component of the pair for different steps of the algorithm. </template> </category> <category> <pattern> what is Asymmetric Warfare</pattern> <template> Asymmetric warfare is the fact that a small investment of the attacker, properly leveraged, can yield incredible results. </template> </category> <category> <pattern> what is Auditing</pattern> <template> Auditing in the context of information security is the information gathering and analysis of assets to ensure such things as policy compliance and security from vulnerabilities. </template> </category> <category> <pattern> what is Authentication</pattern> <template> Authentication is the process of confirming the correctness of the claimed identity. </template> </category> <category> <pattern> what is Authenticity</pattern> <template> Authenticity is the validity and conformance of the original information. </template> </category> <category> <pattern> what is Authorization</pattern> <template> Authorization is the approval, permission, or empowerment for someone or something to do something. </template> </category> <category> <pattern> what is an Autonomous System</pattern> <template> One network or series of networks that are all under one administrative control. An autonomous system is also sometimes referred to as a routing domain. An autonomous system is assigned a globally unique number, sometimes called an Autonomous System Number (ASN). </template> </category> <category> <pattern> what is Availability</pattern> <template> Availability is the need to ensure that the business purpose of the system can be met and that it is accessible to those who need to use it. </template> </category> <category> <pattern> what is a Backdoor</pattern> <template> A backdoor is a tool installed after a compromise to give an attacker easier access to the compromised system around any security mechanisms that are in place. </template> </category> <category> <pattern> what is Bandwidth</pattern> <template> Commonly used to mean the capacity of a communication channel to pass data through the channel in a given amount of time. Usually expressed in bits per second. </template> </category> <category> <pattern> what is a Banner</pattern> <template> A banner is the information that is displayed to a remote user trying to connect to a service. This may include version information, system information, or a warning about authorized use. </template> </category> <category> <pattern> what is Basic Authentication</pattern> <template> Basic Authentication is the simplest web-based authentication scheme that works by sending the username and password with each request. </template> </category> <category> <pattern> what is a Bastion Host</pattern> <template> A bastion host has been hardened in anticipation of vulnerabilities that have not been discovered yet. </template> </category> <category> <pattern> what is BIND</pattern> <template> BIND stands for Berkeley Internet Name Domain and is an implementation of DNS. DNS is used for domain name to IP address resolution. </template> </category> <category> <pattern> what is Biometrics</pattern> <template> Biometrics use physical characteristics of the users to determine access. </template> </category> <category> <pattern> what is a Bit</pattern> <template> The smallest unit of information storage; a contraction of the term "binary digit;" one of two symbols?0" (zero) and "1" (one) - that are used to represent binary numbers. </template> </category> <category> <pattern> what is a Block Cipher</pattern> <template> A block cipher encrypts one block of data at a time.</template> </category> <category> <pattern> what is a Boot Record Infector</pattern> <template> A boot record infector is a piece of malware that inserts malicious code into the boot sector of a disk. </template> </category> <category> <pattern> what is the Border Gateway Protocol or BGP</pattern> <template> An inter-autonomous system routing protocol. BGP is used to exchange routing information for the Internet and is the protocol used between Internet service providers (ISP). </template> </category> <category> <pattern> what is a Botnet</pattern> <template> A botnet is a large number of compromised computers that are used to create and send spam or viruses or flood a network with messages as a denial of service attack. </template> </category> <category> <pattern> what is a Bridge</pattern> <template> A product that connects a local area network (LAN) to another local area network that uses the same protocol (for example, Ethernet or token ring). </template> </category> <category> <pattern> what is British Standard 7799</pattern> <template> A standard code of practice and provides guidance on how to secure an information system. It includes the management framework, objectives, and control requirements for information security management systems. </template> </category> <category> <pattern> what is a Broadcast</pattern> <template> To simultaneously send the same message to multiple recipients. One host to all hosts on network. </template> </category> <category> <pattern> what is a Broadcast Address</pattern> <template> An address used to broadcast a datagram to all hosts on a given network using UDP or ICMP protocol. </template> </category> <category> <pattern> what is a Browser</pattern> <template> A client computer program that can retrieve and display information from servers on the World Wide Web. </template> </category> <category> <pattern> what is Brute Force</pattern> <template> A cryptanalysis technique or other kind of attack method involving an exhaustive procedure that tries all possibilities, one-by-one. </template> </category> <category> <pattern> what is a Buffer Overflow</pattern> <template> A buffer overflow occurs when a program or process tries to store more data in a buffer (temporary data storage area) in a computer's memory than it was intended to hold. Since buffers are created to contain a finite amount of data, the extra information - which has to go somewhere - can overflow into adjacent buffers, corrupting or overwriting the valid data held in them. </template> </category> <category> <pattern> what is a Business Continuity Plan or BCP</pattern> <template> A Business Continuity Plan is the plan for emergency response, backup operations, and post-disaster recovery steps that will ensure the availability of critical resources and facilitate the continuity of operations in an emergency situation. </template> </category> <category> <pattern> what is Business Impact Analysis or BIA</pattern> <template> A Business Impact Analysis determines what levels of adverse impact usually in the form of down time to a system are tolerable. </template> </category> <category> <pattern> what is a Byte</pattern> <template> A fundamental unit of computer storage; the smallest addressable unit in a computer's memory. </template> </category> <category> <pattern> what is Cache</pattern> <template> Pronounced cash, a special high-speed storage mechanism. It can be either a reserved section of main memory or an independent high-speed storage device. Two types of caching are commonly used in personal computers: memory caching and disk caching. </template> </category> <category> <pattern> what is Cache Cramming</pattern> <template> Cache Cramming is the technique of tricking a browser to run cached Java code from the local disk, instead of the internet zone, so it runs with less restrictive permissions. </template> </category> <category> <pattern> what is Cache Poisoning</pattern> <template> Malicious or misleading data from a remote name server is saved cached by another name server. Typically used with DNS cache poisoning attacks. </template> </category> <category> <pattern> what Call Admission Control or CAC</pattern> <template> The inspection and control of all inbound and outbound voice network activity by a voice firewall based on user-defined policies. </template> </category> <category> <pattern> what is a Cell</pattern> <template> A cell is a unit of data transmitted over an ATM network. </template> </category> <category> <pattern> what is Certificate Based Authentication</pattern> <template> Certificate-Based Authentication is the use of SSL and certificates to authenticate and encrypt HTTP traffic. </template> </category> <category> <pattern> what is CGI</pattern> <template> Common Gateway Interface. This mechanism is used by HTTP servers (web servers) to pass parameters to executable scripts in order to generate responses dynamically. </template> </category> <category> <pattern> what is Chain of Custody</pattern> <template> Chain of Custody is the application of rules of evidence and its handling. These differ across jurisdictions. </template> </category> <category> <pattern> what is the Challenge Handshake Authentication Protocol or CHAP </pattern> <template> The Challenge-Handshake Authentication Protocol uses a challenge/response authentication mechanism where the response varies every challenge to prevent replay attacks. </template> </category> <category> <pattern> what is a Checksum</pattern> <template> A value that is computed by a function that is dependent on the contents of a data object and is stored or transmitted together with the object, for the purpose of detecting changes in the data. </template> </category> <category> <pattern> what is a Cipher</pattern> <template> A cryptographic algorithm for encryption and decryption. </template> </category> <category> <pattern> what is Ciphertext</pattern> <template> Ciphertext is the encrypted form of the message being sent. </template> </category> <category> <pattern> what is Circuit Switched Network</pattern> <template> A circuit switched network is where a single continuous physical circuit connected two endpoints where the route was immutable once set up. </template> </category> <category> <pattern> what is Client</pattern> <template> A system entity that requests and uses a service provided by another system entity, called a "server." In some cases, the server may itself be a client of some other server. </template> </category> <category> <pattern> what is a Collision</pattern> <template> A collision occurs when multiple systems transmit simultaneously on the same wire. </template> </category> <category> <pattern> what is Competitive Intelligence</pattern> <template> Competitive Intelligence is espionage using legal, or at least not obviously illegal, means. </template> </category> <category> <pattern> what is a Computer Emergency Response Team or CERT</pattern> <template> An organization that studies computer and network INFOSEC in order to provide incident response services to victims of attacks, publish alerts concerning vulnerabilities and threats, and offer other information to help improve computer and network security. </template> </category> <category> <pattern> what is a Computer Network</pattern> <template> A collection of host computers together with the sub-network or inter-network through which they can exchange data. </template> </category> <category> <pattern> what is Confidentiality</pattern> <template> Confidentiality is the need to ensure that information is disclosed only to those who are authorized to view it. </template> </category> <category> <pattern> what is Configuration Management</pattern> <template> A process to Establish a known baseline condition and manage it. </template> </category> <category> <pattern> what is a Cookie</pattern> <template> A peace of text contained in a file to allow data exchanged between an HTTP server and a browser (a client of the server) to store state information on the client side and retrieve it later for server use. An HTTP server, when sending data to a client, may send along a cookie, which the client retains after the HTTP connection closes. A server can use this mechanism to maintain persistent client-side state information for HTTP-based applications, retrieving the state information in later connections. </template> </category> <category> <pattern> what is Corruption</pattern> <template> A threat action that undesirably alters system operation by adversely modifying system functions or data. </template> </category> <category> <pattern> what is Cost Benefit Analysis</pattern> <template> A cost benefit analysis compares the cost of implementing countermeasures with the value of the reduced risk. </template> </category> <category> <pattern> what is a Countermeasure</pattern> <template> Reactive methods used to prevent an exploit from successfully occurring once a threat has been detected. Intrusion Prevention Systems (IPS) commonly employ countermeasures to prevent intruders form gaining further access to a computer network. Other counter measures are patches, access control lists and malware filters. </template> </category> <category> <pattern> what are Covert Channels</pattern> <template> Covert Channels are the means by which information can be communicated between two parties in a covert fashion using normal system operations. For example by changing the amount of hard drive space that is available on a file server can be used to communicate information. </template> </category> <category> <pattern> what is Cron</pattern> <template> Cron is a Unix application that runs jobs for users and administrators at scheduled times of the day. </template> </category> <category> <pattern> what is aCrossover Cable</pattern> <template> A crossover cable reverses the pairs of cables at the other end and can be used to connect devices directly together. </template> </category> <category> <pattern> what is Cryptanalysis</pattern> <template> The mathematical science that deals with analysis of a cryptographic system in order to gain knowledge needed to break or circumvent the protection that the system is designed to provide. In other words, convert the cipher text to plaintext without knowing the key. Cryptographic Algorithm or Hash An algorithm that employs the science of cryptography, including encryption algorithms, cryptographic hash algorithms, digital signature algorithms, and key agreement algorithms. </template> </category> <category> <pattern> what is Cut Through</pattern> <template> Cut-Through is a method of switching where only the header of a packet is read before it is forwarded to its destination. </template> </category> <category> <pattern> what is Cyclic Redundancy Check or CRC</pattern> <template> A type of checksum algorithm that is not a cryptographic hash but is used to implement data integrity service where accidental changes to data are expected. </template> </category> <category> <pattern>what is integrity</pattern> <template>Integrity refers to the accuracy of data. Data should remain unchanged from its original form unless the change has been authorized.</template> </category> <category> <pattern>ARE YOU A<set name="BOTSYNONIMS"></set></pattern> <template>Does it matter</template> </category> <category> <pattern>HOW DO I BREAK INTO HACKING</pattern> <template> <srai>WHAT ARE THE PROSPECTS IN CYBER SECURITY</srai> </template> </category> <category> <pattern>I WANT TO HACK *</pattern> <template>Hmm, good idea but do ask nicely first or participate in a bug bounty. If you have done so then do the following. Scan the target see what is running Figure out how to exploit it also read books at<a href="https://www.amazon.com/Best-Sellers-Books-Computer-Hacking/zgbs/books/3648">Best Sellers in Computer Hacking</a> </template> </category> <category> <pattern>i want to hack</pattern> <template> <srai>I WANT TO HACK *</srai> </template> </category> <category> <pattern>I WANT TO BE A HACKER</pattern> <template> <srai>I want to hack*</srai> </template> </category> <category> <pattern>I WANT TO HACK THE*</pattern> <template> <srai>I want to hack*</srai> </template> </category> <category> <pattern>how do i hack</pattern> <template> <srai>I WANT TO HACK *</srai> </template> </category> <category> <pattern>how do i hack *</pattern> <template> <srai>how do i hack</srai> </template> </category> <category> <pattern> Are you confident that you know all the latest vulnerabilities and industry knowledge</pattern> <template> <srai>How do you stay updated</srai> </template> </category> <category><pattern>how * latest vulnerabilities</pattern><template><srai>How do you stay updated</srai></template></category> <category><pattern>what will you do in a crisis</pattern> <template> My focus is on root cause analysis. I would determine what assets have been compromised and alert relevant stake holders. I would then put crisis management plans into operations and where necessary bring in outside help especially for areas such as forensic analysis. I would be the single point of contact especially for the board during the duration of the crisis. </template> </category> <category><pattern>what will you do if a breach occurs</pattern><template><srai>what will you do in a crisis</srai></template></category> <category><pattern>How will you manage incidents</pattern><template><srai>what will you do in a crisis</srai></template></category> <category><pattern>what keeps you awake *</pattern><template><srai>what keeps you awake at night</srai></template></category>] <category><pattern>Did you sleep *</pattern><template><srai>what keeps you awake at night</srai></template></category> <category><pattern>what will you do in a cyber attack</pattern><template><srai>what will you do in a crisis</srai></template></category> <category><pattern>HOW do I<set name="defendsynonyms"></set> data</pattern><template> It depends on the size of your organization. however, mostly, you need to keep regular backups, test them, ensure that all systems irrespective of their place in your network are fully patched. </template></category> <category><pattern>how do i<set name="defendsynonyms"></set> against <set name="malwaresynonims"></set></pattern> <template> the best defense is prevention. rapid patching across the entire landscape, frequent testing of software and using native features of operating systems for protection will help. </template> </category> <category> <pattern>How * set * strong password</pattern> <template> A strong password is usually 8 or more characters in length. It is a mix of upper case and lower case characters. It also includes special characters such as punctuations. In addition, it should not contain any dictionary words. Try to use pass phrases which are a series of words separated by spaces. Today however, passwords are dated. Use multifactor authentication and or key based authentication. </template> </category> <category> <pattern>are you *</pattern> <template> I am a bot. </template> </category> <category> <pattern>you are *</pattern> <template> I know what I am, a bot. </template> </category> <category> <pattern>who is a ciso</pattern> <template> A ciso is a chief information security officer. He is responsible for overseeing the information security of an organization. </template> </category> <category> <pattern>what is a ciso</pattern> <template> <srai>who is a ciso</srai> </template> </category> </aiml>
More intelligence
I continued to read about artificial intelligence and heard about Adam Geitgey of machine learning is fun. My thanks to the team at py image search. I read some of Mr. Geitgey’s articles on medium and was hooked. He was the first one to introduce me to natural language processing. He did cover a chat bot in his book but that was a different kind of bot. I did learn much from him and the book and I will reveal more later about how he helpped me find the next solution to my problem.
Subsequent reading and searching lead me to Dr. Jason Brownlee of machine learning mastery. This site is focused on developers and I wanted to buy all of Dr. Brownlee’s books immediately. I could not do so but finally settled on Deep Learning for
Natural Language Processing.
The book starts from the basics and takes you up till building a translation system. That was exactly what I needed. I thought if a model could be built to convert English sentences, to French sentences, why could I not have the questions as one language and corresponding answers as the other language?
This appears to be a valid approach but I ran into all kinds of trouble with my training environment.
The database again.
Machine learning models need a lot of data so my existing database would not do.
I thought and searched until I found the stack exchange data dumps.
Where are the Stack Exchange data dumps?
I had to write code to get the data into a question answer format from XML. I used a sqlite database as an intermediate step.
from lxml import etree from io import StringIO, BytesIO import sqlite3 def create_connection(db_file): """ create a database connection to the SQLite database specified by db_file :param db_file: database file :return: Connection object or None """ try: conn = sqlite3.connect(db_file) return conn except Error as e: print(e) return None #Create a connection to the database and insert the values pulled from the xml file into a single table. conn=create_connection('st.db') sqlq='''INSERT INTO posts(id, Post_Type_ID, PARENT_ID, Body) values (?, ?, ?, ?)''' cur=conn.cursor() tree = etree.parse("posts.xml") lc=0 for child in tree.iter(): id=child.attrib.get('Id') postTypeId=child.attrib.get('PostTypeId') parentId=child.attrib.get('ParentId') body=child.attrib.get('Body') squery=(id, postTypeId, parentId, body) cur.execute(sqlq, squery) conn.commit() lc=lc+1 conn.close() print("done" + str(lc))
Once that was done, I had to get the data into a pandas data frame. This was for easier manipulation.
The stack overflow link did describe the data structure of the post which made it easier to write the query to pull the data.
I parsed the data from the XML file, extracted the body and kept only those questions that had answers.
from selectolax.parser import HTMLParser import sqlite3 import pandas as pd def get_text_selectolax(html): tree = HTMLParser(html) if tree.body is None: return None for tag in tree.css('script'): tag.decompose() for tag in tree.css('style'): tag.decompose() text = tree.body.text(separator='n') return text def create_connection(db_file): """ create a database connection to the SQLite database specified by db_file :param db_file: database file :return: Connection object or None """ try: conn = sqlite3.connect(db_file) return conn except Error as e: print(e) return None conn=create_connection('st.db') questions=[] answers=[] answerq='''select parent_id,body from posts where parent_id is not null and post_type_id=2''' questionq='''select body from posts where id=?''' questionCursor=conn.cursor() cur=conn.cursor() for rw in cur.execute(answerq): questionCursor.execute(questionq,(rw[0],)) qrw=questionCursor.fetchone() questions.append(get_text_selectolax(str(qrw[0]).replace("n",""))) answers.append(get_text_selectolax(str(rw[1]).replace("n",""))) print("questions=" + str(len(questions))) print("answers=" + str(len(answers))) conn.close() print("database work done") df=pd.DataFrame() df.insert(0,"title",questions) df.insert(1,"paragraphs",answers) df.to_csv("qa_cdqa.csv") print("Saving completed.")
Some filtering
I now had a problem, there were plenty of answers that had links in them. I did not want the bot to output links ideally. I decided to filter out the question answer pairs that had links.
import pickle import re import pandas as pd def findURL(s): m= re.search('https?://(?:[-w.]|(?:%[da-fA-F]{2}))+', str(s)) return m #load data sets qs=[] ans=[] with open('questions.pkl', 'rb') as fp: qs=pickle.load(fp) with open('answers.pkl', 'rb') as fpp: ans=pickle.load(fpp) for i, q in enumerate(qs): if findURL(q) or findURL(ans[i]): del qs[i] del ans[i] print("deleted") mgl=list(zip(qs, ans)) df=pd.DataFrame(mgl, columns=['Questions', 'Answers']) df.to_csv("q_a.csv") with open('questionsPure.pkl', 'wb') as fpp: pickle.dump(qs,fpp) with open('answersPure.pkl', 'wb') as fp: pickle.dump(ans, fp) print("done")
Note:
I had tried writing to pickle files at one point for faster loading.
Training the model
I began to train the model. I quickly ran into memory issues because I had not realized how large my data set was. The problem showed up when I had to one hot encode my variables. I tried a number of things such as converting the pickle files to text files and loading them into the model.
I then began loading the data in batches which should have helped and it did but the predictions were too inaccurate.
See the below code for an initial implementation of the model.
import numpy import pickle import sys from keras.preprocessing.text import Tokenizer from keras.preprocessing import sequence from keras.utils import to_categorical from keras.utils.vis_utils import plot_model from keras.models import Sequential from keras.layers import LSTM from keras.layers import Dense from keras.layers import Embedding from keras.layers import RepeatVector from keras.layers import TimeDistributed from keras.callbacks import ModelCheckpoint import os os.environ['KMP_DUPLICATE_LIB_OK']='True' def is_eof(f): s = f.read(1) if s != b'': # restore position f.seek(-1, os.SEEK_CUR) return s == b'' def encode_output(sequences, vocab_size): ylist = list() for sequence in sequences: encoded = to_categorical(sequence, num_classes=vocab_size) ylist.append(encoded) y = numpy.array(ylist) y = y.reshape(sequences.shape[0], sequences.shape[1], vocab_size) return y def feedData(fileName, batchSize, answersVocabSize, questionsVocabSize, questionsL, answersL): rf=open(fileName, 'r') batchCounter=0 listQ=[] listAns=[] qTkn=Tokenizer() ansTkn=Tokenizer() with open('questionTokenizer.pkl', 'rb') as fp: qTkn=pickle.load(fp) with open('answerTokenizer.pkl', 'rb') as fp1: ansTkn=pickle.load(fp1) while True: while batchCounter<batchSize: listQ=[] listAns=[] lpq=[] lpAns=[] tl="" tl=rf.readline() textLine=tl.split("t#") if textLine is None: break textLine=rf.readline().split('t') encodedQuestions=qTkn.texts_to_sequences(textLine[0]) encodedAnswers=ansTkn.texts_to_sequences(textLine[1]) e=encode_output(encodedAnswers, answersVocabSize) listAns.append(e) listQ.append(encodedQuestions) print("batch counter=" + str(batchCounter)) lpQ=[] lpAns=[] lpQ=sequence.pad_sequences(listQ, maxlen=questionsL, dtype='object', padding='post') lpAns=sequence.pad_sequences(listAns, maxlen=answersL, dtype='object', padding='post') yield (lpQ, lpAns) batchCounter=batchCounter+1 batchCounter=0 if is_eof(rf): return rf.close() #end of routine def define_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units): model = Sequential() model.add(Embedding(src_vocab, n_units, input_length=src_timesteps, mask_zero=True)) model.add(LSTM(n_units)) model.add(RepeatVector(tar_timesteps)) model.add(LSTM(n_units, return_sequences=True)) model.add(TimeDistributed(Dense(tar_vocab, activation='softmax'))) # compile model model.compile(optimizer='adam', loss='categorical_crossentropy') # summarize defined model model.summary() plot_model(model, to_file='model.png', show_shapes=True) return model def max_length(lines): return max(len(line.split()) for line in lines) def tokenizePad(l1, l2): questionTokenizer=Tokenizer(num_words=60000) answerTokenizer=Tokenizer(num_words=60000) questionTokenizer.fit_on_texts(l1) answerTokenizer.fit_on_texts(l2) questionsVocabularySize=len(questionTokenizer.word_index) + 1 answersVocabularySize=len(answerTokenizer.word_index) + 1 questionsVocabularySize=60000 answersVocabularySize=60000 questionsLength=max_length(l1) answersLength=max_length(l2) print("questions vocabulary size=" + str(questionsVocabularySize)) print("questions maximum length=" + str(questionsLength)) print("answers vocabulary size=" + str(answersVocabularySize)) print("answers length="+ str(answersLength)) with open('questionTokenizer.pkl', 'wb') as fpp: pickle.dump(questionTokenizer,fpp) with open('answerTokenizer.pkl', 'wb') as fpp1: pickle.dump(answerTokenizer,fpp1) batch_size=32 num_steps=200 spe=len(l1)//(batch_size*num_steps) epochs=3 model = define_model(questionsVocabularySize, answersVocabularySize, questionsLength, answersLength, batch_size) checkpoint = ModelCheckpoint('model.h5', monitor='val_loss', verbose=1, save_best_only=True, mode='min') model.fit_generator(generator=feedData('training.txt', batch_size, answersVocabularySize, questionsVocabularySize, questionsLength, answersLength), validation_data=feedData('testing.txt', batch_size, answersVocabularySize, questionsVocabularySize, questionsLength, answersLength), steps_per_epoch=spe, validation_steps=spe, epochs=epochs, callbacks=[checkpoint], verbose=2) #load data sets qs=[] ans=[] with open('questionsPure.pkl', 'rb') as fp: qs=pickle.load(fp) with open('answersPure.pkl', 'rb') as fpp: ans=pickle.load(fpp) tokenizePad(qs, ans) print("program complete.")
I even wrote a data generator.
class DataGenerator(Sequence): def __init__(self, listQuestions, listAnswers, questionsLength, answersLength, answersVocabularySize, questionsVocabularySize, batchSize): self.questionSequences=listQuestions #Tokenized question sequences self.answerSequences=listAnswers #tokenized answer sequences self.questionsLength=questionsLength #length of question sequences self.answersLength=answersLength #length of answer sequences self.answersVocabularySize=answersVocabularySize #number of distinct words in answers self.questionsVocabularySize=questionsVocabularySize #number of words in questions self.batchSize=batchSize #number of transactions in one training set self.listCounter=0 #primary counter tracking how many sequences are done self.on_epoch_end() #routine to set the counter to zero # one hot encode target sequence def encode_output(self, sequences, vocab_size): ylist = list() for sequence in sequences: encoded = to_categorical(sequence, num_classes=vocab_size) ylist.append(encoded) y = numpy.array(ylist) #y = y.reshape(sequences.shape[0], sequences.shape[1], vocab_size) return y def __len__(self): 'Denotes the number of batches per epoch' return int(np.floor(len(self.answerSequences) / self.batchSize)) def __data_generation(self): batchCounter=0 listQ=[] listAns=[] while batchCounter<self.batchSize: e=encode_output(self.answerSequences[self.listCounter],self.answersVocabularySize) listAns.append(e) listQ.append(self.questionSequences[self.listCounter]) batchCounter=batchCounter+1 return (pad_sequences(listQ,maxlen=self.questionsLength, padding='post'), pad_sequences(listAns, maxlen=self.answersLength, padding='post')) def on_epoch_end(self): self.listCounter=0 def __getitem__(self): x, y =self.__data_generation() self.listCounter=self.listCounter+1 return (x, y) #end of class
See the below code that brings all this together.
Note:
I had discovered the advantages of pandas by this time particularly in handling large data sets in memory.
import numpy as np import pandas as pd import sys from keras.preprocessing.text import Tokenizer from keras.preprocessing import sequence from keras.preprocessing.sequence import pad_sequences from keras.utils import to_categorical from keras.utils.vis_utils import plot_model from keras.models import Sequential from keras.layers import LSTM from keras.layers import Dense from keras.layers import Embedding from keras.layers import RepeatVector from keras.layers import TimeDistributed from keras.callbacks import ModelCheckpoint import string def splitFrames(df, headSize) : hd = df.head(headSize) tl = df.tail(len(df)-headSize) return hd, tl def logData(logStr): with open('robo_train.log', 'a', encoding="utf8", errors="surrogateescape") as f: w=f.write(logStr + "n") f.close() def max_length(lines): return max(len(line.split()) for line in lines) # define NMT model def define_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units): model = Sequential() model.add(Embedding(src_vocab, n_units, input_length=src_timesteps, mask_zero=True)) model.add(LSTM(n_units)) model.add(RepeatVector(tar_timesteps)) model.add(LSTM(n_units, return_sequences=True)) model.add(TimeDistributed(Dense(tar_vocab, activation='softmax'))) # compile model model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc']) # summarize defined model model.summary() plot_model(model, to_file='model.png', show_shapes=True) return model def feedData(qTokenizer, ansTokenizer, qVocabularySize, ansVocabularySize, qLength, ansLength, batchSize): rowCounter=1 totalRows=len(df_train.index) while True: x_list=np.empty((0,qLength)) y_list=np.empty((0,ansLength,1)) for lc in range(batchSize): logData("round " + str(lc)) logData("q vocabulary size within generator:" + str(qVocabularySize)) logData("Ans vocabulary size within generator:" + str(ansVocabularySize)) logData("q max seq length in generator:" + str(qLength)) logData("ans max seq length in generator:" + str(ansLength)) questionsCellVal=df_train.ix[rowCounter:rowCounter,"Questions"].to_list() answersCellVal=df_train.ix[rowCounter:rowCounter,"Answers"].to_list() questionSequences=qTokenizer.texts_to_sequences(questionsCellVal) answerSequences=ansTokenizer.texts_to_sequences(answersCellVal) questionsPadded=pad_sequences(questionSequences,qLength, padding='post') answersPadded=pad_sequences(answerSequences,ansLength, padding='post') y=np.array(answersPadded) y_train=np.expand_dims(y,axis=-1) x_train=np.array(questionsPadded) logData("shape of input:" + str(x_train.shape)) logData("shape of output:" + str(y_train.shape)) x_list=np.concatenate((x_list,x_train)) y_list=np.concatenate((y_list,y_train)) logData("x list shape:" + str(x_list.shape)) logData("y list shape:" + str(y_list.shape)) rowCounter=rowCounter+1 yield (x_list, y_list) if rowCounter>=totalRows: rowCounter=1 #load the data dfSource=pd.read_csv("qa_prime.csv", engine="python") dfSource.drop_duplicates(subset='Questions', inplace=True) dfSource.drop_duplicates(subset='Answers', inplace=True) df_train, df_validate=splitFrames(dfSource,2000) questionTokenizer=Tokenizer() questionTokenizer.fit_on_texts(df_train["Questions"]) answerTokenizer=Tokenizer() answerTokenizer.fit_on_texts(df_train["Answers"]) maxQ=max_length(df_train["Questions"].tolist()) maxAns=max_length(df_train["Answers"].tolist()) questionsVocabularySize=len(questionTokenizer.word_index) + 1 answersVocabularySize=len(answerTokenizer.word_index) + 1 btch_size=3 spe=int(len(df_train.index)/btch_size) logData("steps per epoch:" + str(spe)) model = define_model(questionsVocabularySize, answersVocabularySize, maxQ, maxAns, btch_size) checkpoint = ModelCheckpoint('model.h5', monitor='val_loss', verbose=1, save_best_only=True, mode='auto') #model.fit(questionsPadded, y_train, epochs=30, batch_size=32,validation_split=0.3, shuffle=True, callbacks=[checkpoint], verbose=2) model.fit_generator(feedData(questionTokenizer, answerTokenizer, questionsVocabularySize, answersVocabularySize, maxQ, maxAns,btch_size),epochs=100, verbose=2, steps_per_epoch=spe, use_multiprocessing=True,workers=6, callbacks=[checkpoint]) model.save('model.h5')
If trying the above code, be careful with the shaping of the array that goes into x variable and the y variable as well. The x variable contains the question sequences while the y variable contains the answer sequences.
Here is the code I used to make predictions on the model once it had been trained. The crucial point to remember here is that you need the same vocabulary and pre-processing steps for text as you did when creating the model so keep your code as modular as possible.
Other question answer frameworks
I then tried other frameworks like
cdqa
I did not make any significant progress due to version issues with python if I remember correctly. Yes, I was using virtual environments.
Note:
At the time of this writing, this repository is no longer being maintained. I have not tried the new system they refer to in their read me file.
I then gave the deep Pavlov framework a go. I managed to get the demonstration working but training my own questions was a challenge on my computer. That was last year and things may well have changed now.
Back to basics, the manual approach
I left the project alone for some time. I was looking at some old e-mail when I found an exchange between me and Adam Geitgey where I had asked him about summarizing text. I had also read an article on medium about using document similarity to build a chat bot. I decided to try this approach.
The approach is as follows.
- Get a database of questions and answers.
- Tokenize the questions and answers and also for good measure lemmatize them too. This will account for differences in the meanings of similar words
- Capture the question from the user and subject it to the same pre-processing as the questions in the database.
- Stream the question to the database and do the similarity comparison.
- For similar questions, pull the answers into a list.
- Summarize the answers.
I used the spacy library to do the heavy lifting. It has intelligent tokenization and also mechanisms to compare document similarity.
import pandas as pd import spacy import pathlib import multiprocessing from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer from sumy.summarizers.lsa import LsaSummarizer as Summarizer from sumy.nlp.stemmers import Stemmer from sumy.utils import get_stop_words import sys class Bot: """ Process the text and handle all the bot functions """ def __init__(self): """ Initialize class level variables. Load the nlp spacey object. """ self.questionText="" self.nlp = spacy.load("en_core_web_lg") self.nlp.disable_pipes('ner') #Disable name entity recognition because we do not need it def process_text(self, text): """Clean and tokenize text. This function is used to clean and tokenize both the text of the question and answers. """ text=text.lower() doc = self.nlp(text) result = [] for token in doc: if token.text in self.nlp.Defaults.stop_words: continue if token.is_punct: continue if token.lemma_ == '-PRON-': continue result.append(token.lemma_) return " ".join(result) def assembleAnswer(self, ansList): """ Assembles the answers from the supplied raw answers. parameters: ansList- list a list of answers retrieved by row numbers returns: a string containing the summarized answer ready for display. """ answerString="".join(ansList) answerSummary="" LANGUAGE = "english" SENTENCES_COUNT = 10 parser = PlaintextParser.from_string(answerString, Tokenizer(LANGUAGE)) stemmer = Stemmer(LANGUAGE) summarizer = Summarizer(stemmer) summarizer.stop_words = get_stop_words(LANGUAGE) for sentence in summarizer(parser.document, SENTENCES_COUNT): answerSummary=answerSummary+str(sentence) return answerSummary def runbot(self, userQuestionText): pth=pathlib.Path("cdqa_tokenized.csv") self.questionText=userQuestionText df=pd.read_csv(pth, engine='python') df.drop_duplicates(subset='title', inplace=True) qText=self.questionText cleanQ=self.process_text(qText) qDoc=self.nlp(cleanQ) #Begin filtering the answers for key words from the tokenized question questionWordList=cleanQ.split(" ") questionsTokensList=df['questiontokens'].tolist() targetQuestions=[] for questionWords in questionsTokensList: for w in questionWordList: if str(questionWords).find(str(w))>=0: targetQuestions.append(questionWords) questionsList=[] numWorkers=multiprocessing.cpu_count()-1 for doc in self.nlp.pipe(targetQuestions, batch_size=128, n_process=numWorkers): sm=0.0 sm=doc.similarity(qDoc) if sm>.8: questionsList.append(str(doc)) filteredFrame=df[df['questiontokens'].isin(questionsList)] answer=self.assembleAnswer(filteredFrame['paragraphs']) df=None filteredFrame=None return answer if __name__ == "__main__": bot=Bot() s=bot.runbot("how do i hack a firewall") print(s)
The first problem I ran into was the speed. The bot took several seconds to analyze the questions. The above code contains the fix to the problem. I had initially tried looping through the questions in pandas and running the similarity computations. I even tried multi-threading but that did not help.
The bot still takes about 4 or 5 seconds to answer the question. This may be a limitation of my hardware.
The user interface
I had an acceptable bot but how do I get it out to the rest of the world? I had initially tried by using digital ocean but that became expensive. I wanted to host the bot locally and did not want to have a complex user interface.
I had used remi and decided to go with it again because of its sheer simplicity.
See how easy it all is.
There is no need to mess with HTML.
import remi.gui as gui from remi import start, App import cisoBot class CisoBotPrime(App): def __init__(self, *args): super(CisoBotPrime, self).__init__(*args) def main(self): container = gui.VBox(style={'margin':'0px auto'}) self.lbl_question = gui.Label('Enter question:') self.tb=gui.TextInput(width=100, height=200) self.bt = gui.Button('Ask') self.bt.onclick.do(self.on_button_pressed) self.answerLabel=gui.Label("Answer:") container.append(self.lbl_question ) container.append(self.tb) container.append(self.bt) container.append(self.answerLabel) return container # listener function def on_button_pressed(self, widget): res="" self.answerLabel.set_text("") qst=self.tb.get_value() if len(qst)>0: bot=cisoBot.Bot() res=bot.runbot(qst) self.tb.set_text("") if len(res)<=0: res="Please ask a question related to information security or reword your query." else: res="specify a= question" self.answerLabel.set_text(res) if __name__ == "__main__": start(CisoBotPrime, port=21000, multiple_instance=True, start_browser=False, debug=True, address="0.0.0.0")
The interface worked but only on local host. I had to proxy it out of my development machine.
I could have opened remi up to the rest of the world but it needs projection against javascript related attacks. I needed a better proxy that could handle malformed links etc.
I tried nginx but there seems to be a problem with web sockets which is what remi uses.
Enter HAProxy
I settled on HAProxy which worked out of the box.
See how simple the basic configuration is.
pranav@machinelearning:~/manualCISOBot$ cat hc.conf global daemon maxconn 256 defaults mode http timeout connect 5000ms timeout client 50000ms timeout server 50000ms frontend http-in bind *:24000 default_backend servers backend servers server server1 127.0.0.1:21000 maxconn 32
Publishing the bot
I did not want the hassle of port forwarding and creating an additional entry point into my home network. The answer was tunneling. I used ngrok to expose the bot via a tunnel to the rest of the world.
Thanks for reading. If you want to subscribe to this blog, please fill out the below form.
[mailerlite_form form_id=1]