Building a Telegram Bot to Detect Cyberbullying: A Python Machine Learning Tutorial
Learn how to build a machine model to detect cyber-bullying in group chats and how to deploy it as a telegram bot.
Bots (short for robots) are software applications or scripts that perform automated tasks. These tasks can range from simple, repetitive actions to complex, intelligent interactions. Bots can be designed to operate on various platforms, including websites, messaging apps, social media platforms, and more.
In this article, we'll be building a bot on a social platform (Telegram) to manage cyber-bullying. Telegram is a messaging and social media platform that allows users to send messages, interact with groups, and channels, make voice and video calls and a lot more. The bot will have access to messages in groups and be able to remove group members who engage in cyber-bullying. This bot hopes to regulate conversations and help group admins manage their groups. An example of the bot can be found here.
Prerequisites
To follow through with what will be done in this tutorial, you need to have the following:
Good knowledge of Python programming.
Basic understanding of Machine Learning algorithms.
Python installed on your computer.
A text editor - Visual Studio Code preferably.
Libraries Used
In this tutorial, you'll be using Python libraries and packages with the major ones being; python-telegram-bot, scikit-learn, sqlalchemy, pandas, numpy and nltk. Python-telegram-bot is a Python wrapper for the telegram API that makes it easier to write Python code to communicate with the API, SQLAlchemy will be used to create a simple database to track and store information of members in group chats and the number of cyberbullying-related messages they've sent. Pandas, nltk, scikit-learn and pandas will be used to build a machine learning model to identify and predict if a message is cyberbullying-related or not.
Install packages
To install the packages, run the below line in your terminal:
pip install nltk pandas numpy sqlalchemy scikit-learn python-telegram-bot
Building the telegram bot
The python-telegram-bot package has several high-level classes to make the development of bots easy. The python-telegram-bot package has two major submodules, the pure python telegram module (which you can use to fetch updates and send messages also) and a telegram.ext module which has a lot of inbuilt objects and classes that take work off your shoulders.
To build the telegram bot, the steps you'll follow are listed below:
Create the bot and get the API Token
First, you have to create a bot with BotFather, BotFather is the father of all bots on Telegram, with BotFather, you can create, and manage your Telegram bots and tokens.
You can communicate with BotFather by searching "BotFather" on telegram.
When you click "Botfather", a chat will be open where you can communicate with the bot, click on Start.
Type "/newbot" next to create a new bot, BotFather will ask you for a name and unique username for your bot.
Then you'll receive a congratulatory message, confirming you have successfully created a telegram bot, with a link to access your bot and the bot token. The token is like a password that you use when telling your bot to perform an instruction, it is important to keep your token secret as anyone who has access to your token can control your bot.
Building with python-telegram-bot
To work with the python-telegram-bot package, there are some basic classes you should be familiar with:
The Application class is responsible for fetching updates from a queue called the update_queue, there is also another class called the Updater that fetches updates on messages from telegram and stores them in this queue, You can then create functions also called handlers to handle updates of different types and add the handlers to your Application.
There are many types of handlers but the ones that will be used in this tutorial are Command Handlers and Message Handlers, command handlers are used to handle commands in telegram, and commands are messages that begin with "/", for example, the "/newbot" you sent to BotFather was a command. Message Handlers are used to handle messages.
The handler functions accept two parameters, update and context, An update is an object that contains all the information and data that are coming from telegram, and a context, is another object that contains information and data about the status of the library itself (like the Bot, the Application etc).
Another important class are filters, as the name implies, they help to filter the kind of updates that can be handled by a handler.
With all these concepts cleared out of the way, you can then go ahead to write the codes for your bot.
Create a folder for your project and a Python file with the name bot.py.
In the bot.py file, import the necessary packages and create a string for the bot token.
#import the necessary packages
from telegram import Update
from telegram.ext import filters, CommandHandler, MessageHandler,ContextTypes,Application
TOKEN="YOUR BOT TOKEN"
Next, Add a logging system.
Logging is a crucial tool for debugging, monitoring, and maintaining applications.
#logging
import logging
logging.basicConfig(
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", level=logging.INFO
)
Create the start handler function
Create a function to handle the "/start" command. When a user opens the bot and types the "/start" command, you'll want the bot to tell the user basic information about your bot and what it does.
#start command
async def start(update: Update, context: ContextTypes.DEFAULT_TYPE):
"""Give a simple explanation of what the bot does"""
await context.bot.send_message( chat_id=update.effective_chat.id, text=f"This is a cyber bullying bot, I Will help you remove users who engage in cyber-bullying on your group chats")
Create an application and add the start handler to your application.
Create an application, a command handler for the start function, add the handler to your application and run the application using the
.run_polling()
function. Thefilters=~filters.ChatType.GROUPS
parameter specifies that you don't want this handler to handle updates from Group chats.
if __name__ == "__main__":
application = Application.builder().token(TOKEN).build()
start_handler = CommandHandler("start", start, filters=~filters.ChatType.GROUPS)
application.add_handler(start_handler)
application.run_polling()
Register the start command with bot father.
Telegram doesn't know yet that your bot accepts the "/start" command, so it's important (although not compulsory) to register that command with BotFather. Registering a command with BotFather creates a menu that shows the list of commands that our bot recognizes. You can simply do that with the "/setcommands" command.
- Test your bot by running the Python file and confirm that the start command works.
Add the cyber-bullying messages handler
The cyber-bullying handler will accept messages sent in group chats, and check if the message is cyberbullying-related, if the message is cyberbullying-related, then it adds the user_id and groupchat_id of the group to a database, there will also be a no_bullying parameter that counts the number of cyber-bullying messages the user has sent to the group chat. When the no_bullying exceeds a value (let's use 3 for this project), we want the user to be banned from the group for some days (3 days for this project).
In the bot.py file, create a simple function to identify cyberbullying.
Create a simple function to identify if a text is cyberbullying-related or not. The function works by checking if "fool" is in the text, if so, then the text is cyberbullying-related
Note: This is a very inaccurate way of telling if a text is cyberbullying-related or not, You'll build a more accurate one using machine learning techniques later on in the tutorial.
def is_cyberbullying(text):
if "fool" in text:
return True
return False
Create a temporary database.
After creating the
is_cyberbullying
function, you'll want to create a database to store information on the user_id, groupchat_id and the number of bullying messages sent by a user in a group. A list of dictionaries can be used as a temporary database (which will be cleared when you stop running the Python file) for now to store the information. Later on in the tutorial, you'll learn how to create a real database.
#a list of dictionaries in format {"user_id": 1, "chat_id"1:,"no_bullying":0}
db=[]
Create helper functions to communicate with the temporary database.
The functions are:
A function to add a user_id and groupchat_id to the database anytime we spot a cyberbullying-related message.
A function to check if a user has hit the number of cyber-bullying messages limit for a group chat.
A function to reset the number of cyber-bullying messages a user has sent to zero.
The functions are shown below :
#maximum number of bullying_messages
MAX_BULLYING_MESSAGES=3
NO_BANNED_DAYS=3
def add_to_db(user_id,chat_id):
#search for the user record in the group chat
for i,user_record in enumerate(db):
if (user_record["user_id"]==user_id) and (user_record["chat_id"]==chat_id):
#if user exists, increase the number of bullying message the user has sent by 1
user_record["no_bullying"]+=1
db[i]=user_record
return
#if user doesn't exist, create a new user to the database
db.append({"user_id":user_id,"chat_id":chat_id,"no_bullying":1})
return
def has_hit_limit(user_id,chat_id):
for user_record in db:
if (user_record["user_id"]==user_id) and (user_record["chat_id"]==chat_id):
if user_record["no_bullying"]==MAX_BULLYING_MESSAGES:
return True
return False
def reset_user_record(user_id,chat_id):
for i,user_record in enumerate(db):
if (user_record["user_id"]==user_id) and (user_record["chat_id"]==chat_id):
user_record["no_bullying"]=0
db[i]=user_record
return
Build the cyberbullying handler function.
The cyberbullying handler function works by checking if a message sent to a group chat is cyberbullying-related or not, if yes, then add the user_id and the groupchat_id to the database and send a warning to the user. If the user has hit the MAX_BULLYING_MESSAGES, then you want to ban that user from the group chat by 3 days (NO_BANNED_DAYS=3) and reset the number of cyber-bullying messages the user has sent to zero.
from datetime import datetime, timedelta
#cyberbullying handler
async def remove_cyberbullying(update: Update, context: ContextTypes.DEFAULT_TYPE):
chat_id=update.effective_chat.id
message=update.message
sender_id=message.from_user.id
if is_cyberbullying(message.text):
add_to_db(sender_id,chat_id)
#if user has hit limit
if has_hit_limit(sender_id,chat_id):
#current date
current_date=datetime.now()
ban_duration=timedelta(days=NO_BANNED_DAYS)
unban_date=current_date+ban_duration
#reset record
reset_user_record(sender_id,chat_id)
#remove user
await context.bot.send_message(chat_id=update.effective_chat.id, text=f"The message you've sent is a abusive, and you've exceeding the cyberbullying limit, you'll be banned from the group chat for {NO_BANNED_DAYS} days!!",reply_to_message_id=message.message_id)
await context.bot.ban_chat_member(chat_id=chat_id,user_id=sender_id, revoke_messages=False, until_date=unban_date)
else:
#send a message that the person has sent an abusive message
await context.bot.send_message(chat_id=update.effective_chat.id, text="The message you've sent is a abusive, be careful or you'll be removed from the group chat soon!",reply_to_message_id=message.message_id)
- Finally, create a cyberbullying handler, then add it to the application and run the Python file again to test the bot.
if __name__ == "__main__":
application = Application.builder().token(TOKEN).build()
start_handler = CommandHandler("start", start, filters=~filters.ChatType.GROUPS)
cyberbullying_handler = MessageHandler(filters.TEXT & filters.ChatType.GROUPS , remove_cyberbullying) #filters.TEXT & filters.ChatType.GROUPS tells us we want only text messages from group chats
application.add_handler(start_handler)
application.add_handler(cyberbullying_handler)
application.run_polling()
The full bot.py file (with the start function slightly updated) becomes :
#import the necessary packages
import logging
from datetime import datetime, timedelta
from telegram import Update
from telegram.ext import filters, CommandHandler, MessageHandler,ContextTypes,Application
TOKEN="YOUR BOT TOKEN"
MAX_BULLYING_MESSAGES=3
NO_BANNED_DAYS=3
#a list of dictionaries in format {"user_id": 1, "chat_id"1:,"no_bullying":0}
db=[]
logging.basicConfig(
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", level=logging.INFO
)
def is_cyberbullying(text):
if "fool" in text:
return True
return False
def add_to_db(user_id,chat_id):
#search for the user record in the group chat
for i,user_record in enumerate(db):
if (user_record["user_id"]==user_id) and (user_record["chat_id"]==chat_id):
#if user exists, increase the number of bullying message the user has sent by 1
user_record["no_bullying"]+=1
db[i]=user_record
return
#if user doesn't exist, create a new user to the database
db.append({"user_id":user_id,"chat_id":chat_id,"no_bullying":1})
return
def has_hit_limit(user_id,chat_id):
for user_record in db:
if (user_record["user_id"]==user_id) and (user_record["chat_id"]==chat_id):
if user_record["no_bullying"]==MAX_BULLYING_MESSAGES:
return True
return False
def reset_user_record(user_id,chat_id):
for i,user_record in enumerate(db):
if (user_record["user_id"]==user_id) and (user_record["chat_id"]==chat_id):
user_record["no_bullying"]=0
db[i]=user_record
return
#start command
async def start(update: Update, context: ContextTypes.DEFAULT_TYPE):
"""Give a simple explanation of what the bot does"""
await context.bot.send_message(
chat_id=update.effective_chat.id, text=f"This is a cyber bullying bot, I Will help you remove users who engage in cyber-bullying on your group chats\nHow to use?\n\nAdd me as an admin to your group chat and give me permission to remove users, send and delete messages. \nUsers are given {MAX_BULLYING_MESSAGES} opportunities, if a user has sent up to {MAX_BULLYING_MESSAGES} messages recognized as cyberbullying, the user will be removed and banned for {NO_BANNED_DAYS} days!"
)
#cyberbullying handler
async def remove_cyberbullying(update: Update, context: ContextTypes.DEFAULT_TYPE):
chat_id=update.effective_chat.id
message=update.message
sender_id=message.from_user.id
if is_cyberbullying(message.text):
add_to_db(sender_id,chat_id)
#if user has hit limit
if has_hit_limit(sender_id,chat_id):
#current date
current_date=datetime.now()
ban_duration=timedelta(days=NO_BANNED_DAYS)
unban_date=current_date+ban_duration
#reset record
reset_user_record(sender_id,chat_id)
#remove user
await context.bot.send_message(chat_id=update.effective_chat.id, text=f"The message you've sent is a abusive, and you've exceeding the cyberbullying limit, you'll be banned from the group chat for {NO_BANNED_DAYS} days!!",reply_to_message_id=message.message_id)
await context.bot.ban_chat_member(chat_id=chat_id,user_id=sender_id, revoke_messages=False, until_date=unban_date)
else:
#send a message that the person has sent an abusive message
await context.bot.send_message(chat_id=update.effective_chat.id, text="The message you've sent is a abusive, be careful or you'll be removed from the group chat soon!",reply_to_message_id=message.message_id)
if __name__ == "__main__":
application = Application.builder().token(TOKEN).build()
start_handler = CommandHandler("start", start, filters=~filters.ChatType.GROUPS)
cyberbullying_handler = MessageHandler(filters.TEXT & filters.ChatType.GROUPS , remove_cyberbullying) #filters.TEXT & filters.ChatType.GROUPS tells us we want only text messages from group chats
application.add_handler(start_handler)
application.add_handler(cyberbullying_handler)
application.run_polling()
The telegram bot logic is complete now and you can test it by adding it to a group, ensuring you make the bot an admin and give it all admin permissions.
Note: The bot cannot remove the owner of a group chat!
Build a machine-learning model to identify cyberbullying messages
In this section, you'll be replacing the is_cyberbullying()
function with a machine-learning text classifier, this will identify cyber-bullying messages better than the simple function used in the previous function. So let's dive in.
Data
To build a machine learning model to identify whether a text is cyberbullying or not, you'll need data on cyberbullying and non-cyberbullying texts. Kaggle is a good platform where you can find data on several projects. Two datasets on Kaggle were found to be useful for this project, the kaggle_parsed_dataset.csv from this Cyberbullying Dataset page and also a JSON file from this Tweets Dataset for Detection of Cyber-Trolls page on Kaggle.
Download both datasets, create a new folder "data" in your project folder and save both datasets in the folder.
Create a new Python file "model.py", This Python file will be used to clean the data and build the model.
Read the datasets.
In the "model.py" file, load the datasets using pandas
.read_csv()
and.read_json()
functions.
import pandas as pd
#read the dataset
dataset_1=pd.read_json("data/Dataset for Detection of Cyber-Trolls.json",lines=True)
dataset_2=pd.read_csv("data/kaggle_parsed_dataset.csv")
Clean the first dataset (dataset_1).
Rename the columns, drop unused columns, leave the text and label only, also, extract the label from the label function.
#CLEAN DATASET 1
#rename columns
dataset_1.columns=["text","label","extras"]
#drop unused column
dataset_1=dataset_1.drop("extras",axis=1)
def extract_label(label):
""" Original label is in the format {'notes': '', 'label': ['1']}"""
return int(label["label"][0])
dataset_1["label"]=dataset_1["label"].apply(extract_label)
Clean the second dataset (dataset_2).
Rename the columns and remove the starting double quotes from the text column.
#CLEAN DATASET 2
required_cols=["Text","oh_label"]
dataset_2=dataset_2[required_cols]
dataset_2.columns=["text","label"]
def remove_quotes(text):
"""all the strings in the csv file have a double quote "" starting them, let's remove them """
return text[1:len(text)-2]
dataset_2["text"]=dataset_2["text"].apply(remove_quotes)
Concatenate the datasets.
Combine dataset_1 and dataset_2 to form a single dataset.
#CONCATENATE THE DATASETS
all_data=pd.concat([dataset_1,dataset_2])
- Write a function that takes in a text and removes all usernames, URLs, punctuation, numbers and stop words from the text.
import re
import nltk
import string
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
#functions to clean the text
def clean_text(text):
""" Clean the text"""
# Lowering letters
text = text.lower()
# Removing emails & twitter usernames
text = re.sub('\S*@\S*', '', text)
# Removing urls (S+ matches all non whitespace chars)
text = re.sub(r'http\S*', '', text)
# Removing numbers
text = re.sub('[^a-zA-Z]',' ',text)
for punctuation in string.punctuation:
text=text.replace(punctuation,"")
# Removing all whitespaces and join with proper space
word_tokens = word_tokenize(text)
#remove all stop words
stopwords=nltk.corpus.stopwords.words("english")
new_word_tokens=[]
for token in word_tokens:
if token not in stopwords:
new_word_tokens.append(token)
return ' '.join(word_tokens)
all_data["text"]=all_data["text"].apply(clean_text)
- Remove all missing values and duplicate rows.
#remove all_duplicates
all_data=all_data.drop_duplicates(subset="text")
#remove missing values
all_data=all_data.dropna(subset=["text"])
- Separate the label and text columns.
#seperate target and text columns
target=all_data["label"]
text=all_data["text"]
- Separate the dataset into an
X
dataframe for you to train the model and aval
dataset to test the model on.
#Import necessary Sklearn functions and classes
from sklearn.model_selection import train_test_split
X,val,y,y_val=train_test_split(text,target,test_size=0.15,random_state=0)
- You can write all the code to load and clean the data into a function called
clean_data()
to make your code neater.
def clean_data():
#load the dataset
dataset_1=pd.read_json("data/Dataset for Detection of Cyber-Trolls.json",lines=True)
dataset_2=pd.read_csv("data/kaggle_parsed_dataset.csv")
#CLEAN DATASET 1
#rename columns
dataset_1.columns=["text","label","extras"]
#drop unused column
dataset_1=dataset_1.drop("extras",axis=1)
dataset_1["label"]=dataset_1["label"].apply(extract_label)
#CLEAN DATASET 2
required_cols=["Text","oh_label"]
dataset_2=dataset_2[required_cols]
dataset_2.columns=["text","label"]
dataset_2["text"]=dataset_2["text"].apply(remove_quotes)
#CONCATENATE THE DATASETS
all_data=pd.concat([dataset_1,dataset_2])
all_data["text"]=all_data["text"].apply(clean_text)
#remove all_duplicates
all_data=all_data.drop_duplicates(subset="text")
#remove missing values
all_data=all_data.dropna(subset=["text"])
#seperate target and text columns
target=all_data["label"]
text=all_data["text"]
X,val,y,y_val=train_test_split(text,target,test_size=0.15,random_state=0)
return X, val, y, y_val
Model
Generally, machine learning models accept numerical data, so you'd need to convert the text data to numerical values first.
There are many methods of converting text to numerical values, but the most popular ones are with embeddings, count vectorization, TF-IDF and a lot more.
After converting the text to numerical values (vectors), you'll then build a classifier on the numerical vectors. In this tutorial count vectorization and a linear classifier (logistic regression) will be used.
Add the below code in the "model.py" file.
from sklearn.metrics import accuracy_score,precision_score,recall_score,classification_report,f1_score,confusion_matrix,f1_score
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
X, val, y, y_val=clean_data()
vectorizer=CountVectorizer(min_df=2,ngram_range=(1,2))
X_vect=vectorizer.fit_transform(X)
val_vect=vectorizer.transform(val)
print(f"We have {len(vectorizer.vocabulary_)} words in the vocabulary")
#create a logistic regression model
model=LogisticRegression(max_iter=200)
#fit the model
print("FITTING THE MODEL!!! ")
model.fit(X_vect,y)
After building the model, you'll need to evaluate its performance on the validation dataset using classification metrics like f1_score, recall, and precision.
print("Model training complete\n EVALUATING THE MODELS PERFORMANCE ")
predictions=model.predict(val_vect)
f1_score_=f1_score(predictions,y_val)
recall_score_=recall_score(y_val,predictions)
precision_score_=precision_score(y_val,predictions)
print(f"F1 score : {f1_score_}")
print(f"Recall score : {recall_score_}")
print(f"Precision Score: {precision_score_}")
After training the model, you'll have to save the vectorizer and model objects to a file so you can use them when making predictions. The pickle package is Used for serializing and deserializing Python objects. So, pickling the vectorizer and the model is like keeping them safe and ready to use whenever you need them.
Create a new folder "assets" where you'd save the vectorizer and model files. Add the below code in the "model.py" file to save the model and vectorizer.
import pickle
#save the vectorizer and model
model_path="assets/model.pickle"
vectorizer_path="assets/vectorizer.pickle"
pickle.dump(model, open(model_path, 'wb'))
pickle.dump(vectorizer, open(vectorizer_path, "wb"))
Run your code and you'll notice that new files have been added to your assets folder "model.pickle" and "vectorizer.pickle".
To make your code neater, write all the necessary lines to build the model into a function. The updated code then becomes:
import re
import nltk
import string
import pickle
import pandas as pd
import numpy as np
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
def extract_label(label):
""" Original label is in the format {'notes': '', 'label': ['1']}"""
return int(label["label"][0])
def remove_quotes(text):
"""all the strings in the csv file have a double quote "" starting them, let's remove them """
return text[1:len(text)-2]
def clean_text(text):
""" Clean the text"""
# Lowering letters
text = text.lower()
# Removing emails & twitter usernames
text = re.sub('\S*@\S*', '', text)
# Removing urls (S+ matches all non whitespace chars)
text = re.sub(r'http\S*', '', text)
# Removing numbers
text = re.sub('[^a-zA-Z]',' ',text)
for punctuation in string.punctuation:
text=text.replace(punctuation,"")
# Removing all whitespaces and join with proper space
word_tokens = word_tokenize(text)
stopwords=nltk.corpus.stopwords.words("english")
new_word_tokens=[]
for token in word_tokens:
if token not in stopwords:
new_word_tokens.append(token)
return ' '.join(word_tokens)
#Import necessary Sklearn functions and classes
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,precision_score,recall_score,classification_report,f1_score,confusion_matrix,f1_score
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
def clean_data():
#load the dataset
dataset_1=pd.read_json("data/Dataset for Detection of Cyber-Trolls.json",lines=True)
dataset_2=pd.read_csv("data/kaggle_parsed_dataset.csv")
#CLEAN DATASET 1
#rename columns
dataset_1.columns=["text","label","extras"]
#drop unused column
dataset_1=dataset_1.drop("extras",axis=1)
dataset_1["label"]=dataset_1["label"].apply(extract_label)
#CLEAN DATASET 2
required_cols=["Text","oh_label"]
dataset_2=dataset_2[required_cols]
dataset_2.columns=["text","label"]
dataset_2["text"]=dataset_2["text"].apply(remove_quotes)
#CONCATENATE THE DATASETS
all_data=pd.concat([dataset_1,dataset_2])
all_data["text"]=all_data["text"].apply(clean_text)
#remove all_duplicates
all_data=all_data.drop_duplicates(subset="text")
#remove missing values
all_data=all_data.dropna(subset=["text"])
#seperate target and text columns
target=all_data["label"]
text=all_data["text"]
X,val,y,y_val=train_test_split(text,target,test_size=0.15,random_state=0)
return X, val, y, y_val
def train_model(X,val,y,y_val):
#create a bag of words model
vectorizer=CountVectorizer(min_df=2,ngram_range=(1,2))
X_vect=vectorizer.fit_transform(X)
val_vect=vectorizer.transform(val)
print(f"We have {len(vectorizer.vocabulary_)} words in the vocabulary")
#create a logistic regression model
model=LogisticRegression(max_iter=200)
#fit the model
print("FITTING THE MODEL!!! ")
model.fit(X_vect,y)
print("Model training complete\n EVALUATING THE MODELS PERFORMANCE ")
predictions=model.predict(val_vect)
f1_score_=f1_score(predictions,y_val)
recall_score_=recall_score(y_val,predictions)
precision_score_=precision_score(y_val,predictions)
print(f"F1 score : {f1_score_}")
print(f"Recall score : {recall_score_}")
print(f"Precision Score: {precision_score_}")
return model, vectorizer
if __name__ == "__main__":
#clean the data and train the model only when this file is run directly.
X, val, y, y_val=clean_data()
model,vectorizer=train_model(X,val,y,y_val)
model_path="assets/model.pickle"
vectorizer_path="assets/vectorizer.pickle"
pickle.dump(model, open(model_path, 'wb'))
pickle.dump(vectorizer, open(vectorizer_path, "wb"))
Update the is_cyberbullying function
Since you've built a machine-learning model to identify cyberbullying-related text, update the is_cyberbullying()
function to use the machine-learning model you've built.
In the bot.py file, update the is_cyberbullying
function.
#import the clean_text function from the model.py file
#(since you'll want to do the same data cleaning you
#did on the train data on the messages.
import pickle
from model import clean_text
model_path="assets/model.pickle"
vectorizer_path="assets/vectorizer.pickle"
def is_cyberbullying(text):
#read the saved vectorizers and models
vectorizer = pickle.load(open(vectorizer_path,'rb'))
model = pickle.load(open(model_path,'rb'))
#clean the text
text=clean_text(text)
#convert the text to vector and make predictions
prediction=model.predict(vectorizer.transform([text]))[0]
if prediction==1:
return True
return False
Add a real database
The previous database we used was a simple Python list, which is temporary and becomes empty immediately after you stop running the Python file. In this section, you'll learn how to add a real-world database using SQLAlchemy.
SQL (Structured Query Language) is a language used for managing and manipulating relational databases. It allows users to interact with databases to perform operations like querying data, inserting, updating, and deleting records, and creating or modifying database structures. SQLAlchemy is a Python library for managing and interacting with SQL databases using Python code.
How to use SQLAlchemy
Create a test_sql.py file in your project folder, you'll use the file to create a simple SQL database and learn how to perform CRUD (create, read, update and delete) operations with SQLAlchemy. Follow the below steps in the test_sql.py file
- Importing SQLAlchemy
import sqlalchemy as db
- Create the Database and the GroupMembers Table (the table where you'll store the data).
engine = db.create_engine('sqlite:///test_database.sqlite')
#create a connection
conn = engine.connect()
#metadata
metadata = db.MetaData()
GroupMembers = db.Table('GroupMembers', metadata,
db.Column('Id', db.Integer(), primary_key=True),
db.Column('user_id', db.Integer),
db.Column('groupchat_id', db.Integer),
db.Column('no_bullying', db.Integer),
)
metadata.create_all(engine)
- Inserting Data (Create).
query = db.insert(GroupMembers).values(user_id=4427, groupchat_id=125,no_bullying=1)
Result = conn.execute(query)
query = db.insert(GroupMembers).values(user_id=2, groupchat_id=12,no_bullying=1)
Result = conn.execute(query)
- Reading data
#read all items in the table
output = conn.execute(GroupMembers.select()).fetchall()
print(output)
#seaech for specific items
query = GroupMembers.select().where(GroupMembers.columns.user_id==4427)
output = conn.execute(query)
print(output.fetchone())
#search for items with multiple conditions
query = GroupMembers.select().where(db.and_(GroupMembers.columns.user_id == 2, GroupMembers.columns.groupchat_id == 12))
output = conn.execute(query)
result=output.fetchone()
print(result)
- Updating data
# Get previous value of no_bullying
no_bullying = result.no_bullying
query = GroupMembers.update().where(
db.and_(GroupMembers.columns.user_id == result.user_id, GroupMembers.columns.groupchat_id == result.groupchat_id)
).values(no_bullying=no_bullying + 1)
conn.execute(query)
- Deleting data
query = GroupMembers.delete().where(
db.and_(GroupMembers.columns.user_id == 4427, GroupMembers.columns.groupchat_id == 125)
)
# Execute the delete statement
conn.execute(query)
That's it! You have learned how to use SQLAlchemy to create a simple database, define tables, perform CRUD operations, and manipulate data in the database.
Since you know how to perform basic operations already, it's time to create a database for the bot.
Building the database
Create a new Python file "database.py".
This file will contain all functions necessary for the bot to communicate with the database.
Create the Group member's table.
import sqlalchemy as db
#create an engine kn
engine = db.create_engine('sqlite:///database.sqlite')
#create a connection
conn = engine.connect()
#metadata
metadata = db.MetaData()
GroupMembers = db.Table('GroupMembers', metadata,
db.Column('Id', db.Integer(), primary_key=True),
db.Column('user_id', db.Integer),
db.Column('groupchat_id', db.Integer),
db.Column('no_bullying', db.Integer),
)
metadata.create_all(engine)
- Rewrite the previous functions you wrote earlier to add a user_id and groupchat_id to the database, check if a user has hit the cyberbullying limit and reset the no of bullying messages the user has sent to zero using SQLAlchemy
MAX_BULLYING_MESSAGES=3
def add_to_db(user_id,groupchat_id):
#first check if the user_id and groupchat_id are in our database
search_query = GroupMembers.select().where(db.and_(GroupMembers.columns.user_id == user_id, GroupMembers.columns.groupchat_id == groupchat_id))
output = conn.execute(search_query)
result=output.fetchone()
if result:
#increase no_bullying value by 1
no_bullying = result.no_bullying
update_query = GroupMembers.update().where( db.and_(GroupMembers.columns.user_id == result.user_id, GroupMembers.columns.groupchat_id == result.groupchat_id)).values(no_bullying=no_bullying + 1)
conn.execute(update_query)
else:
#add the user
insert_query = db.insert(GroupMembers).values(user_id=user_id, groupchat_id=groupchat_id,no_bullying=1)
conn.execute(insert_query)
def has_hit_limit(user_id,groupchat_id):
""" check if the user has hit the cyberbullying
limit for that group chat """
#search for user
search_query = GroupMembers.select().where(db.and_(GroupMembers.columns.user_id == user_id, GroupMembers.columns.groupchat_id == groupchat_id))
output = conn.execute(search_query)
result=output.fetchone()
if result:
if result.no_bullying>=MAX_BULLYING_MESSAGES:
return True
return False
def reset_user_record(user_id,groupchat_id):
update_query = GroupMembers.update().where( db.and_(GroupMembers.columns.user_id == user_id, GroupMembers.columns.groupchat_id == groupchat_id)).values(no_bullying=0)
conn.execute(update_query)
So finally, remove the previous functions from the bot.py file and import them from the database.py file instead.
The final bot.py file:
import pickle
import os
import logging
from datetime import datetime, timedelta
from telegram import Update
from telegram.ext import filters, MessageHandler,CommandHandler, ContextTypes,Application
from model import clean_text
from database import add_to_db, has_hit_limit, reset_user_record,MAX_BULLYING_MESSAGES
#maximum number of bullying_messages
NO_BANNED_DAYS=3
TOKEN="YOUR BOT TOKEN"
model_path="assets/model.pickle"
vectorizer_path="assets/vectorizer.pickle"
logging.basicConfig(
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", level=logging.INFO
)
def is_cyberbullying(text):
vectorizer = pickle.load(open(vectorizer_path,'rb'))
model = pickle.load(open(model_path,'rb'))
text=clean_text(text)
prediction=model.predict(vectorizer.transform([text]))[0]
if prediction==1:
return True
return False
#start command
async def start(update: Update, context: ContextTypes.DEFAULT_TYPE):
"""Give a simple explanation of what the bot
does"""
await context.bot.send_message(
chat_id=update.effective_chat.id, text=f"This is a cyber bullying bot, I Will help you remove users who engage in cyberbullying on your group chats\nHow to use?\n\nAdd me as an admin to your group chat and give me permission to remove users, send and delete messages. \nUsers are given {MAX_BULLYING_MESSAGES} opportunities, if a user has sent up to {MAX_BULLYING_MESSAGES} messages recognized as cyberbullying, the user will be removed and banned for {NO_BANNED_DAYS} days!"
)
#cyberbullying handler
async def remove_cyberbullying(update: Update, context: ContextTypes.DEFAULT_TYPE):
chat_id=update.effective_chat.id
message=update.message
sender_id=message.from_user.id
if is_cyberbullying(message.text):
add_to_db(sender_id,chat_id)
#if user has hit limit
if has_hit_limit(sender_id,chat_id):
#current date
current_date=datetime.now()
ban_duration=timedelta(days=NO_BANNED_DAYS)
unban_date=current_date+ban_duration
#reset record
reset_user_record(sender_id,chat_id)
#remove user
await context.bot.send_message(chat_id=update.effective_chat.id, text=f"The message you've sent is a abusive, and you've exceeding the cyberbullying limit, you'll be banned from the group chat for {NO_BANNED_DAYS} days!!",reply_to_message_id=message.message_id)
await context.bot.ban_chat_member(chat_id=chat_id,user_id=sender_id, revoke_messages=False, until_date=unban_date)
else:
#send a message that the person has sent an abusive message
await context.bot.send_message(chat_id=update.effective_chat.id, text="The message you've sent is a abusive, be careful or you'll be removed from the group chat soon!",reply_to_message_id=message.message_id)
if __name__ == "__main__":
application = Application.builder().token(TOKEN).build()
start_handler = CommandHandler("start", start, filters=~filters.ChatType.GROUPS)
cyberbullying_handler = MessageHandler(filters.TEXT & filters.ChatType.GROUPS , remove_cyberbullying) #filters.TEXT & filters.ChatType.GROUPS tells us we want only text messages from group chats
application.add_handler(start_handler)
application.add_handler(cyberbullying_handler)
application.run_polling()
The full source code for the project can be found in my GitHub repository.
Run the bot.py file again, add the bot to group chats and ensure you give it admin permissions, Have people send cyberbullying-related messages and confirm that it removes them from the group chat.
And that's all! You've successfully created a database for your bot to store data.
Conclusion
In this tutorial, you've learned how to create a telegram bot, build a simple machine-learning text classifier, build an SQL database with SQLAlchemy and a lot more. It is important to note that many other ideas and improvements can still be made to this bot to make it better. Some of them include:
Making the bot delete messages it perceives as cyberbullying immediately from the group.
Source for data to build a model with higher performance and to do a lot more (for example, recognize spam messages, 18+ rated messages).
Build a more advanced text classifier using pre-trained embeddings, pre-trained models or more complex neural networks.
Use more advanced SQL databases like MySQL, and PostgreSQL instead of the simple SQLite database used.
Deploy your bot to a platform like Heroku or Digital Ocean.
A lot more interesting projects can be built on the telegram API and you can learn more from the official API documentation.
Thanks for reading!