Building a Telegram Bot to Detect Cyberbullying: A Python Machine Learning Tutorial

Learn how to build a machine model to detect cyber-bullying in group chats and how to deploy it as a telegram bot.

Building a Telegram Bot to Detect Cyberbullying: A Python Machine Learning Tutorial

Bots (short for robots) are software applications or scripts that perform automated tasks. These tasks can range from simple, repetitive actions to complex, intelligent interactions. Bots can be designed to operate on various platforms, including websites, messaging apps, social media platforms, and more.

In this article, we'll be building a bot on a social platform (Telegram) to manage cyber-bullying. Telegram is a messaging and social media platform that allows users to send messages, interact with groups, and channels, make voice and video calls and a lot more. The bot will have access to messages in groups and be able to remove group members who engage in cyber-bullying. This bot hopes to regulate conversations and help group admins manage their groups. An example of the bot can be found here.

Prerequisites

To follow through with what will be done in this tutorial, you need to have the following:

  • Good knowledge of Python programming.

  • Basic understanding of Machine Learning algorithms.

  • Python installed on your computer.

  • A text editor - Visual Studio Code preferably.

Libraries Used

In this tutorial, you'll be using Python libraries and packages with the major ones being; python-telegram-bot, scikit-learn, sqlalchemy, pandas, numpy and nltk. Python-telegram-bot is a Python wrapper for the telegram API that makes it easier to write Python code to communicate with the API, SQLAlchemy will be used to create a simple database to track and store information of members in group chats and the number of cyberbullying-related messages they've sent. Pandas, nltk, scikit-learn and pandas will be used to build a machine learning model to identify and predict if a message is cyberbullying-related or not.

Install packages

To install the packages, run the below line in your terminal:

pip install nltk pandas numpy sqlalchemy scikit-learn python-telegram-bot

Building the telegram bot

The python-telegram-bot package has several high-level classes to make the development of bots easy. The python-telegram-bot package has two major submodules, the pure python telegram module (which you can use to fetch updates and send messages also) and a telegram.ext module which has a lot of inbuilt objects and classes that take work off your shoulders.

To build the telegram bot, the steps you'll follow are listed below:

Create the bot and get the API Token

First, you have to create a bot with BotFather, BotFather is the father of all bots on Telegram, with BotFather, you can create, and manage your Telegram bots and tokens.

You can communicate with BotFather by searching "BotFather" on telegram.

When you click "Botfather", a chat will be open where you can communicate with the bot, click on Start.

Type "/newbot" next to create a new bot, BotFather will ask you for a name and unique username for your bot.

Then you'll receive a congratulatory message, confirming you have successfully created a telegram bot, with a link to access your bot and the bot token. The token is like a password that you use when telling your bot to perform an instruction, it is important to keep your token secret as anyone who has access to your token can control your bot.

Building with python-telegram-bot

To work with the python-telegram-bot package, there are some basic classes you should be familiar with:

The Application class is responsible for fetching updates from a queue called the update_queue, there is also another class called the Updater that fetches updates on messages from telegram and stores them in this queue, You can then create functions also called handlers to handle updates of different types and add the handlers to your Application.

There are many types of handlers but the ones that will be used in this tutorial are Command Handlers and Message Handlers, command handlers are used to handle commands in telegram, and commands are messages that begin with "/", for example, the "/newbot" you sent to BotFather was a command. Message Handlers are used to handle messages.

The handler functions accept two parameters, update and context, An update is an object that contains all the information and data that are coming from telegram, and a context, is another object that contains information and data about the status of the library itself (like the Bot, the Application etc).

Another important class are filters, as the name implies, they help to filter the kind of updates that can be handled by a handler.

With all these concepts cleared out of the way, you can then go ahead to write the codes for your bot.

  • Create a folder for your project and a Python file with the name bot.py.

  • In the bot.py file, import the necessary packages and create a string for the bot token.

#import the necessary packages
from telegram import Update
from telegram.ext import  filters, CommandHandler, MessageHandler,ContextTypes,Application

TOKEN="YOUR BOT TOKEN"
  • Next, Add a logging system.

    Logging is a crucial tool for debugging, monitoring, and maintaining applications.

#logging
import logging
logging.basicConfig(
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", level=logging.INFO
)
  • Create the start handler function

    Create a function to handle the "/start" command. When a user opens the bot and types the "/start" command, you'll want the bot to tell the user basic information about your bot and what it does.

#start command
async def start(update: Update, context: ContextTypes.DEFAULT_TYPE):
    """Give a simple explanation of what the bot does"""
    await context.bot.send_message( chat_id=update.effective_chat.id, text=f"This is a cyber bullying bot, I Will help you remove users who engage in cyber-bullying on your group chats")
  • Create an application and add the start handler to your application.

    Create an application, a command handler for the start function, add the handler to your application and run the application using the .run_polling() function. The filters=~filters.ChatType.GROUPS parameter specifies that you don't want this handler to handle updates from Group chats.

if __name__ == "__main__":
    application = Application.builder().token(TOKEN).build()
    start_handler = CommandHandler("start", start, filters=~filters.ChatType.GROUPS)
    application.add_handler(start_handler)
    application.run_polling()
  • Register the start command with bot father.

    Telegram doesn't know yet that your bot accepts the "/start" command, so it's important (although not compulsory) to register that command with BotFather. Registering a command with BotFather creates a menu that shows the list of commands that our bot recognizes. You can simply do that with the "/setcommands" command.

  • Test your bot by running the Python file and confirm that the start command works.

Add the cyber-bullying messages handler

The cyber-bullying handler will accept messages sent in group chats, and check if the message is cyberbullying-related, if the message is cyberbullying-related, then it adds the user_id and groupchat_id of the group to a database, there will also be a no_bullying parameter that counts the number of cyber-bullying messages the user has sent to the group chat. When the no_bullying exceeds a value (let's use 3 for this project), we want the user to be banned from the group for some days (3 days for this project).

  • In the bot.py file, create a simple function to identify cyberbullying.

    Create a simple function to identify if a text is cyberbullying-related or not. The function works by checking if "fool" is in the text, if so, then the text is cyberbullying-related

    Note: This is a very inaccurate way of telling if a text is cyberbullying-related or not, You'll build a more accurate one using machine learning techniques later on in the tutorial.

def is_cyberbullying(text):
    if "fool" in text:
        return True
    return False
  • Create a temporary database.

    After creating the is_cyberbullying function, you'll want to create a database to store information on the user_id, groupchat_id and the number of bullying messages sent by a user in a group. A list of dictionaries can be used as a temporary database (which will be cleared when you stop running the Python file) for now to store the information. Later on in the tutorial, you'll learn how to create a real database.

#a list of dictionaries in format {"user_id": 1, "chat_id"1:,"no_bullying":0}
db=[]
  • Create helper functions to communicate with the temporary database.

    The functions are:

    • A function to add a user_id and groupchat_id to the database anytime we spot a cyberbullying-related message.

    • A function to check if a user has hit the number of cyber-bullying messages limit for a group chat.

    • A function to reset the number of cyber-bullying messages a user has sent to zero.

The functions are shown below :

#maximum number of bullying_messages
MAX_BULLYING_MESSAGES=3
NO_BANNED_DAYS=3


def add_to_db(user_id,chat_id):
    #search for the user record in the group chat 
    for i,user_record in enumerate(db):
        if (user_record["user_id"]==user_id) and (user_record["chat_id"]==chat_id):
            #if user exists, increase the number of bullying message the user has sent by 1
            user_record["no_bullying"]+=1
            db[i]=user_record
            return
    #if user doesn't exist, create a new user to the database 
    db.append({"user_id":user_id,"chat_id":chat_id,"no_bullying":1})
    return 

def has_hit_limit(user_id,chat_id):
    for user_record in db:
        if (user_record["user_id"]==user_id) and (user_record["chat_id"]==chat_id):
            if user_record["no_bullying"]==MAX_BULLYING_MESSAGES:
                return True
            return False

def reset_user_record(user_id,chat_id):
    for i,user_record in enumerate(db):
        if (user_record["user_id"]==user_id) and (user_record["chat_id"]==chat_id):
            user_record["no_bullying"]=0
            db[i]=user_record
            return
  • Build the cyberbullying handler function.

    The cyberbullying handler function works by checking if a message sent to a group chat is cyberbullying-related or not, if yes, then add the user_id and the groupchat_id to the database and send a warning to the user. If the user has hit the MAX_BULLYING_MESSAGES, then you want to ban that user from the group chat by 3 days (NO_BANNED_DAYS=3) and reset the number of cyber-bullying messages the user has sent to zero.

from datetime import datetime, timedelta

#cyberbullying handler
async def remove_cyberbullying(update: Update, context: ContextTypes.DEFAULT_TYPE):
   chat_id=update.effective_chat.id
   message=update.message
   sender_id=message.from_user.id
   if is_cyberbullying(message.text):
       add_to_db(sender_id,chat_id)    
       #if user has hit limit
       if has_hit_limit(sender_id,chat_id):
           #current date
           current_date=datetime.now()
           ban_duration=timedelta(days=NO_BANNED_DAYS)
           unban_date=current_date+ban_duration

           #reset record
           reset_user_record(sender_id,chat_id)

           #remove user 
           await context.bot.send_message(chat_id=update.effective_chat.id, text=f"The message you've sent is a abusive, and you've exceeding the cyberbullying limit, you'll be banned from the group chat for {NO_BANNED_DAYS} days!!",reply_to_message_id=message.message_id)
           await context.bot.ban_chat_member(chat_id=chat_id,user_id=sender_id, revoke_messages=False, until_date=unban_date)     
       else:
           #send a message that the person has sent an abusive message 
           await context.bot.send_message(chat_id=update.effective_chat.id, text="The message you've sent is a abusive, be careful or you'll be removed from the group chat soon!",reply_to_message_id=message.message_id)
  • Finally, create a cyberbullying handler, then add it to the application and run the Python file again to test the bot.
if __name__ == "__main__":
    application = Application.builder().token(TOKEN).build()
    start_handler = CommandHandler("start", start, filters=~filters.ChatType.GROUPS)
    cyberbullying_handler = MessageHandler(filters.TEXT & filters.ChatType.GROUPS , remove_cyberbullying) #filters.TEXT & filters.ChatType.GROUPS tells us we want only text messages from group chats 
    application.add_handler(start_handler)
    application.add_handler(cyberbullying_handler)
    application.run_polling()

The full bot.py file (with the start function slightly updated) becomes :

#import the necessary packages
import logging
from datetime import datetime, timedelta
from telegram import Update
from telegram.ext import  filters, CommandHandler, MessageHandler,ContextTypes,Application

TOKEN="YOUR BOT TOKEN"
MAX_BULLYING_MESSAGES=3
NO_BANNED_DAYS=3

#a list of dictionaries in format {"user_id": 1, "chat_id"1:,"no_bullying":0}
db=[]

logging.basicConfig(
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", level=logging.INFO
)


def is_cyberbullying(text):
    if "fool" in text:
        return True
    return False

def add_to_db(user_id,chat_id):
    #search for the user record in the group chat 
    for i,user_record in enumerate(db):
        if (user_record["user_id"]==user_id) and (user_record["chat_id"]==chat_id):
            #if user exists, increase the number of bullying message the user has sent by 1
            user_record["no_bullying"]+=1
            db[i]=user_record
            return
    #if user doesn't exist, create a new user to the database 
    db.append({"user_id":user_id,"chat_id":chat_id,"no_bullying":1})
    return 

def has_hit_limit(user_id,chat_id):
    for user_record in db:
        if (user_record["user_id"]==user_id) and (user_record["chat_id"]==chat_id):
            if user_record["no_bullying"]==MAX_BULLYING_MESSAGES:
                return True
            return False

def reset_user_record(user_id,chat_id):
    for i,user_record in enumerate(db):
        if (user_record["user_id"]==user_id) and (user_record["chat_id"]==chat_id):
            user_record["no_bullying"]=0
            db[i]=user_record
            return

#start command
async def start(update: Update, context: ContextTypes.DEFAULT_TYPE):
    """Give a simple explanation of what the bot does"""
    await context.bot.send_message(
        chat_id=update.effective_chat.id, text=f"This is a cyber bullying bot, I Will help you remove users who engage in cyber-bullying on your group chats\nHow to use?\n\nAdd me as an admin to your group chat and give me permission to remove users, send and delete messages.  \nUsers are given {MAX_BULLYING_MESSAGES} opportunities, if a user has sent up to {MAX_BULLYING_MESSAGES} messages recognized as cyberbullying, the user will be removed and banned for {NO_BANNED_DAYS} days!"
    )

#cyberbullying handler
async def remove_cyberbullying(update: Update, context: ContextTypes.DEFAULT_TYPE):
   chat_id=update.effective_chat.id
   message=update.message
   sender_id=message.from_user.id
   if is_cyberbullying(message.text):
       add_to_db(sender_id,chat_id)

       #if user has hit limit
       if has_hit_limit(sender_id,chat_id):
           #current date
           current_date=datetime.now()
           ban_duration=timedelta(days=NO_BANNED_DAYS)
           unban_date=current_date+ban_duration

           #reset record
           reset_user_record(sender_id,chat_id)

           #remove user 
           await context.bot.send_message(chat_id=update.effective_chat.id, text=f"The message you've sent is a abusive, and you've exceeding the cyberbullying limit, you'll be banned from the group chat for {NO_BANNED_DAYS} days!!",reply_to_message_id=message.message_id)

           await context.bot.ban_chat_member(chat_id=chat_id,user_id=sender_id, revoke_messages=False, until_date=unban_date)

       else:
           #send a message that the person has sent an abusive message 
           await context.bot.send_message(chat_id=update.effective_chat.id, text="The message you've sent is a abusive, be careful or you'll be removed from the group chat soon!",reply_to_message_id=message.message_id)

if __name__ == "__main__":
    application = Application.builder().token(TOKEN).build()
    start_handler = CommandHandler("start", start, filters=~filters.ChatType.GROUPS)
    cyberbullying_handler = MessageHandler(filters.TEXT & filters.ChatType.GROUPS , remove_cyberbullying) #filters.TEXT & filters.ChatType.GROUPS tells us we want only text messages from group chats 
    application.add_handler(start_handler)
    application.add_handler(cyberbullying_handler)
    application.run_polling()

The telegram bot logic is complete now and you can test it by adding it to a group, ensuring you make the bot an admin and give it all admin permissions.

Note: The bot cannot remove the owner of a group chat!

Build a machine-learning model to identify cyberbullying messages

In this section, you'll be replacing the is_cyberbullying() function with a machine-learning text classifier, this will identify cyber-bullying messages better than the simple function used in the previous function. So let's dive in.

Data

To build a machine learning model to identify whether a text is cyberbullying or not, you'll need data on cyberbullying and non-cyberbullying texts. Kaggle is a good platform where you can find data on several projects. Two datasets on Kaggle were found to be useful for this project, the kaggle_parsed_dataset.csv from this Cyberbullying Dataset page and also a JSON file from this Tweets Dataset for Detection of Cyber-Trolls page on Kaggle.

  • Download both datasets, create a new folder "data" in your project folder and save both datasets in the folder.

  • Create a new Python file "model.py", This Python file will be used to clean the data and build the model.

  • Read the datasets.

    In the "model.py" file, load the datasets using pandas .read_csv() and .read_json() functions.

import pandas as pd
#read the dataset
dataset_1=pd.read_json("data/Dataset for Detection of Cyber-Trolls.json",lines=True)
dataset_2=pd.read_csv("data/kaggle_parsed_dataset.csv")
  • Clean the first dataset (dataset_1).

    Rename the columns, drop unused columns, leave the text and label only, also, extract the label from the label function.

#CLEAN DATASET 1
#rename columns 
dataset_1.columns=["text","label","extras"]
#drop unused column
dataset_1=dataset_1.drop("extras",axis=1)

def extract_label(label):
  """ Original label is in the format {'notes': '', 'label': ['1']}"""
  return int(label["label"][0])
dataset_1["label"]=dataset_1["label"].apply(extract_label)
  • Clean the second dataset (dataset_2).

    Rename the columns and remove the starting double quotes from the text column.

#CLEAN DATASET 2
required_cols=["Text","oh_label"]
dataset_2=dataset_2[required_cols]
dataset_2.columns=["text","label"]

def remove_quotes(text):
  """all  the strings in the csv file have a double quote "" starting them, let's remove them """
  return text[1:len(text)-2]

dataset_2["text"]=dataset_2["text"].apply(remove_quotes)
  • Concatenate the datasets.

    Combine dataset_1 and dataset_2 to form a single dataset.

#CONCATENATE THE DATASETS
all_data=pd.concat([dataset_1,dataset_2])
  • Write a function that takes in a text and removes all usernames, URLs, punctuation, numbers and stop words from the text.
import re
import nltk
import string
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize

#functions to clean the text
def clean_text(text):
  """ Clean  the text"""
  # Lowering letters
  text = text.lower()
  # Removing emails & twitter usernames
  text = re.sub('\S*@\S*', '', text)
  # Removing urls (S+ matches all non whitespace chars)
  text = re.sub(r'http\S*', '', text)
  # Removing numbers
  text = re.sub('[^a-zA-Z]',' ',text)

  for punctuation in string.punctuation:
    text=text.replace(punctuation,"")

  # Removing all whitespaces and join with proper space
  word_tokens = word_tokenize(text)

  #remove all stop words 
  stopwords=nltk.corpus.stopwords.words("english")
  new_word_tokens=[]
  for token in word_tokens:
    if token not in stopwords:
      new_word_tokens.append(token)

    return ' '.join(word_tokens)
all_data["text"]=all_data["text"].apply(clean_text)
  • Remove all missing values and duplicate rows.
#remove all_duplicates
all_data=all_data.drop_duplicates(subset="text")
#remove missing values
all_data=all_data.dropna(subset=["text"])
  • Separate the label and text columns.
#seperate target and text columns 
target=all_data["label"]
text=all_data["text"]
  • Separate the dataset into an X dataframe for you to train the model and a val dataset to test the model on.
#Import necessary Sklearn functions and classes
from sklearn.model_selection import train_test_split

X,val,y,y_val=train_test_split(text,target,test_size=0.15,random_state=0)
  • You can write all the code to load and clean the data into a function called clean_data() to make your code neater.
def clean_data():
    #load the dataset
    dataset_1=pd.read_json("data/Dataset for Detection of Cyber-Trolls.json",lines=True)
    dataset_2=pd.read_csv("data/kaggle_parsed_dataset.csv")

    #CLEAN DATASET 1
    #rename columns 
    dataset_1.columns=["text","label","extras"]
    #drop unused column
    dataset_1=dataset_1.drop("extras",axis=1)
    dataset_1["label"]=dataset_1["label"].apply(extract_label)

    #CLEAN DATASET 2
    required_cols=["Text","oh_label"]
    dataset_2=dataset_2[required_cols]
    dataset_2.columns=["text","label"]
    dataset_2["text"]=dataset_2["text"].apply(remove_quotes)

    #CONCATENATE THE DATASETS
    all_data=pd.concat([dataset_1,dataset_2])

    all_data["text"]=all_data["text"].apply(clean_text)
    #remove all_duplicates
    all_data=all_data.drop_duplicates(subset="text")
    #remove missing values
    all_data=all_data.dropna(subset=["text"])

    #seperate target and text columns 
    target=all_data["label"]
    text=all_data["text"]
    X,val,y,y_val=train_test_split(text,target,test_size=0.15,random_state=0)
    return X, val, y, y_val

Model

Generally, machine learning models accept numerical data, so you'd need to convert the text data to numerical values first.

There are many methods of converting text to numerical values, but the most popular ones are with embeddings, count vectorization, TF-IDF and a lot more.

After converting the text to numerical values (vectors), you'll then build a classifier on the numerical vectors. In this tutorial count vectorization and a linear classifier (logistic regression) will be used.

Add the below code in the "model.py" file.

from sklearn.metrics import accuracy_score,precision_score,recall_score,classification_report,f1_score,confusion_matrix,f1_score
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression

X, val, y, y_val=clean_data()
vectorizer=CountVectorizer(min_df=2,ngram_range=(1,2))
X_vect=vectorizer.fit_transform(X)
val_vect=vectorizer.transform(val)
print(f"We have {len(vectorizer.vocabulary_)} words in the vocabulary")
#create a logistic regression model
model=LogisticRegression(max_iter=200)

#fit the model
print("FITTING THE MODEL!!! ")
model.fit(X_vect,y)

After building the model, you'll need to evaluate its performance on the validation dataset using classification metrics like f1_score, recall, and precision.

print("Model training complete\n EVALUATING THE MODELS PERFORMANCE ")
predictions=model.predict(val_vect)
f1_score_=f1_score(predictions,y_val)
recall_score_=recall_score(y_val,predictions)
precision_score_=precision_score(y_val,predictions)
print(f"F1 score : {f1_score_}")
print(f"Recall score : {recall_score_}")
print(f"Precision Score: {precision_score_}")

After training the model, you'll have to save the vectorizer and model objects to a file so you can use them when making predictions. The pickle package is Used for serializing and deserializing Python objects. So, pickling the vectorizer and the model is like keeping them safe and ready to use whenever you need them.

Create a new folder "assets" where you'd save the vectorizer and model files. Add the below code in the "model.py" file to save the model and vectorizer.

import pickle
#save the vectorizer and model
model_path="assets/model.pickle" 
vectorizer_path="assets/vectorizer.pickle"
pickle.dump(model, open(model_path, 'wb'))
pickle.dump(vectorizer, open(vectorizer_path, "wb"))

Run your code and you'll notice that new files have been added to your assets folder "model.pickle" and "vectorizer.pickle".

To make your code neater, write all the necessary lines to build the model into a function. The updated code then becomes:

import re
import nltk
import string
import pickle
import pandas as pd
import numpy as np
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize



def extract_label(label):
  """ Original label is in the format {'notes': '', 'label': ['1']}"""
  return int(label["label"][0])

def remove_quotes(text):
  """all  the strings in the csv file have a double quote "" starting them, let's remove them """
  return text[1:len(text)-2]



def clean_text(text):
  """ Clean  the text"""
  # Lowering letters
  text = text.lower()

  # Removing emails & twitter usernames
  text = re.sub('\S*@\S*', '', text)

  # Removing urls (S+ matches all non whitespace chars)
  text = re.sub(r'http\S*', '', text)

  # Removing numbers
  text = re.sub('[^a-zA-Z]',' ',text)


  for punctuation in string.punctuation:
    text=text.replace(punctuation,"")

  # Removing all whitespaces and join with proper space
  word_tokens = word_tokenize(text)
  stopwords=nltk.corpus.stopwords.words("english")

  new_word_tokens=[]
  for token in word_tokens:
    if token not in stopwords:
      new_word_tokens.append(token)
    return ' '.join(word_tokens)

#Import necessary Sklearn functions and classes
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,precision_score,recall_score,classification_report,f1_score,confusion_matrix,f1_score
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression

def clean_data():
    #load the dataset
    dataset_1=pd.read_json("data/Dataset for Detection of Cyber-Trolls.json",lines=True)
    dataset_2=pd.read_csv("data/kaggle_parsed_dataset.csv")

    #CLEAN DATASET 1
    #rename columns 
    dataset_1.columns=["text","label","extras"]
    #drop unused column
    dataset_1=dataset_1.drop("extras",axis=1)
    dataset_1["label"]=dataset_1["label"].apply(extract_label)

    #CLEAN DATASET 2
    required_cols=["Text","oh_label"]
    dataset_2=dataset_2[required_cols]
    dataset_2.columns=["text","label"]
    dataset_2["text"]=dataset_2["text"].apply(remove_quotes)

    #CONCATENATE THE DATASETS
    all_data=pd.concat([dataset_1,dataset_2])

    all_data["text"]=all_data["text"].apply(clean_text)
    #remove all_duplicates
    all_data=all_data.drop_duplicates(subset="text")
    #remove missing values
    all_data=all_data.dropna(subset=["text"])

    #seperate target and text columns 
    target=all_data["label"]
    text=all_data["text"]
    X,val,y,y_val=train_test_split(text,target,test_size=0.15,random_state=0)
    return X, val, y, y_val

def train_model(X,val,y,y_val):
    #create a bag of words model
    vectorizer=CountVectorizer(min_df=2,ngram_range=(1,2))
    X_vect=vectorizer.fit_transform(X)
    val_vect=vectorizer.transform(val)
    print(f"We have {len(vectorizer.vocabulary_)} words in the vocabulary")

    #create a logistic regression model
    model=LogisticRegression(max_iter=200)

    #fit the model
    print("FITTING THE MODEL!!! ")
    model.fit(X_vect,y)

    print("Model training complete\n EVALUATING THE MODELS PERFORMANCE ")
    predictions=model.predict(val_vect)
    f1_score_=f1_score(predictions,y_val)
    recall_score_=recall_score(y_val,predictions)
    precision_score_=precision_score(y_val,predictions)
    print(f"F1 score : {f1_score_}")
    print(f"Recall score : {recall_score_}")
    print(f"Precision Score: {precision_score_}")
    return model, vectorizer

if __name__ == "__main__":
    #clean the data and train the model only when this file is run directly.
    X, val, y, y_val=clean_data()
    model,vectorizer=train_model(X,val,y,y_val)
    model_path="assets/model.pickle" 
    vectorizer_path="assets/vectorizer.pickle"
    pickle.dump(model, open(model_path, 'wb'))
    pickle.dump(vectorizer, open(vectorizer_path, "wb"))

Update the is_cyberbullying function

Since you've built a machine-learning model to identify cyberbullying-related text, update the is_cyberbullying() function to use the machine-learning model you've built.

In the bot.py file, update the is_cyberbullying function.

#import the clean_text function from the model.py file 
#(since you'll want to do the same data cleaning you 
#did on the train data on the messages.
import pickle 
from model import clean_text


model_path="assets/model.pickle" 
vectorizer_path="assets/vectorizer.pickle"

def is_cyberbullying(text):
    #read the saved vectorizers and models 
    vectorizer = pickle.load(open(vectorizer_path,'rb'))
    model = pickle.load(open(model_path,'rb'))
    #clean the text
    text=clean_text(text)
    #convert the text to vector and make predictions 
    prediction=model.predict(vectorizer.transform([text]))[0]
    if prediction==1:
        return True
    return False

Add a real database

The previous database we used was a simple Python list, which is temporary and becomes empty immediately after you stop running the Python file. In this section, you'll learn how to add a real-world database using SQLAlchemy.

SQL (Structured Query Language) is a language used for managing and manipulating relational databases. It allows users to interact with databases to perform operations like querying data, inserting, updating, and deleting records, and creating or modifying database structures. SQLAlchemy is a Python library for managing and interacting with SQL databases using Python code.

How to use SQLAlchemy

Create a test_sql.py file in your project folder, you'll use the file to create a simple SQL database and learn how to perform CRUD (create, read, update and delete) operations with SQLAlchemy. Follow the below steps in the test_sql.py file

  • Importing SQLAlchemy
import sqlalchemy as db
  • Create the Database and the GroupMembers Table (the table where you'll store the data).
engine = db.create_engine('sqlite:///test_database.sqlite')
#create a connection 
conn = engine.connect()

#metadata
metadata = db.MetaData()

GroupMembers = db.Table('GroupMembers', metadata,
                        db.Column('Id', db.Integer(), primary_key=True),
                        db.Column('user_id', db.Integer),
                        db.Column('groupchat_id', db.Integer),
                        db.Column('no_bullying', db.Integer),
                        )
metadata.create_all(engine)
  • Inserting Data (Create).
query = db.insert(GroupMembers).values(user_id=4427, groupchat_id=125,no_bullying=1)
Result = conn.execute(query)

query = db.insert(GroupMembers).values(user_id=2, groupchat_id=12,no_bullying=1)
Result = conn.execute(query)
  • Reading data
#read all items in the table
output = conn.execute(GroupMembers.select()).fetchall()
print(output)

#seaech for specific items 
query = GroupMembers.select().where(GroupMembers.columns.user_id==4427)
output = conn.execute(query)
print(output.fetchone())

#search for items with multiple conditions
query = GroupMembers.select().where(db.and_(GroupMembers.columns.user_id == 2, GroupMembers.columns.groupchat_id == 12))
output = conn.execute(query)
result=output.fetchone()
print(result)
  • Updating data
# Get previous value of no_bullying
no_bullying = result.no_bullying

query = GroupMembers.update().where(
    db.and_(GroupMembers.columns.user_id == result.user_id, GroupMembers.columns.groupchat_id == result.groupchat_id)
).values(no_bullying=no_bullying + 1)
conn.execute(query)
  • Deleting data
query = GroupMembers.delete().where(
    db.and_(GroupMembers.columns.user_id == 4427, GroupMembers.columns.groupchat_id == 125)
)
# Execute the delete statement
conn.execute(query)

That's it! You have learned how to use SQLAlchemy to create a simple database, define tables, perform CRUD operations, and manipulate data in the database.

Since you know how to perform basic operations already, it's time to create a database for the bot.

Building the database

  • Create a new Python file "database.py".

    This file will contain all functions necessary for the bot to communicate with the database.

  • Create the Group member's table.

import sqlalchemy as db

#create an engine kn
engine = db.create_engine('sqlite:///database.sqlite')
#create a connection 
conn = engine.connect()

#metadata
metadata = db.MetaData()

GroupMembers = db.Table('GroupMembers', metadata,
                        db.Column('Id', db.Integer(), primary_key=True),
                        db.Column('user_id', db.Integer),
                        db.Column('groupchat_id', db.Integer),
                        db.Column('no_bullying', db.Integer),
                        )

metadata.create_all(engine)
  • Rewrite the previous functions you wrote earlier to add a user_id and groupchat_id to the database, check if a user has hit the cyberbullying limit and reset the no of bullying messages the user has sent to zero using SQLAlchemy
MAX_BULLYING_MESSAGES=3
def add_to_db(user_id,groupchat_id):
    #first check if the user_id and groupchat_id are in our database 
    search_query = GroupMembers.select().where(db.and_(GroupMembers.columns.user_id == user_id, GroupMembers.columns.groupchat_id == groupchat_id))
    output = conn.execute(search_query)
    result=output.fetchone()

    if result:
        #increase no_bullying value by 1
        no_bullying = result.no_bullying
        update_query = GroupMembers.update().where( db.and_(GroupMembers.columns.user_id == result.user_id, GroupMembers.columns.groupchat_id == result.groupchat_id)).values(no_bullying=no_bullying + 1)
        conn.execute(update_query)
    else:
        #add the user 
        insert_query = db.insert(GroupMembers).values(user_id=user_id, groupchat_id=groupchat_id,no_bullying=1)
        conn.execute(insert_query)

def has_hit_limit(user_id,groupchat_id):
    """ check if the user has hit the cyberbullying
        limit for that group chat """
    #search for user
    search_query = GroupMembers.select().where(db.and_(GroupMembers.columns.user_id == user_id, GroupMembers.columns.groupchat_id == groupchat_id))
    output = conn.execute(search_query)
    result=output.fetchone()
    if result:
        if result.no_bullying>=MAX_BULLYING_MESSAGES:
            return True
    return False 

def reset_user_record(user_id,groupchat_id):
    update_query = GroupMembers.update().where( db.and_(GroupMembers.columns.user_id == user_id, GroupMembers.columns.groupchat_id == groupchat_id)).values(no_bullying=0)
    conn.execute(update_query)

So finally, remove the previous functions from the bot.py file and import them from the database.py file instead.

The final bot.py file:

import pickle
import os
import logging
from datetime import datetime, timedelta
from telegram import Update
from telegram.ext import filters, MessageHandler,CommandHandler, ContextTypes,Application
from model import clean_text
from database import add_to_db, has_hit_limit, reset_user_record,MAX_BULLYING_MESSAGES


#maximum number of bullying_messages
NO_BANNED_DAYS=3
TOKEN="YOUR BOT TOKEN"
model_path="assets/model.pickle" 
vectorizer_path="assets/vectorizer.pickle"

logging.basicConfig(
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", level=logging.INFO
)

def is_cyberbullying(text):
    vectorizer = pickle.load(open(vectorizer_path,'rb'))
    model = pickle.load(open(model_path,'rb'))
    text=clean_text(text)
    prediction=model.predict(vectorizer.transform([text]))[0]
    if prediction==1:
        return True
    return False

#start command
async def start(update: Update, context: ContextTypes.DEFAULT_TYPE):
    """Give a simple explanation of what the bot 
        does"""
    await context.bot.send_message(
        chat_id=update.effective_chat.id, text=f"This is a cyber bullying bot, I Will help you remove users who engage in cyberbullying on your group chats\nHow to use?\n\nAdd me as an admin to your group chat and give me permission to remove users, send and delete messages.  \nUsers are given {MAX_BULLYING_MESSAGES} opportunities, if a user has sent up to {MAX_BULLYING_MESSAGES} messages recognized as cyberbullying, the user will be removed and banned for {NO_BANNED_DAYS} days!"
    )

#cyberbullying handler
async def remove_cyberbullying(update: Update, context: ContextTypes.DEFAULT_TYPE):
   chat_id=update.effective_chat.id
   message=update.message
   sender_id=message.from_user.id
   if is_cyberbullying(message.text):
       add_to_db(sender_id,chat_id)

       #if user has hit limit
       if has_hit_limit(sender_id,chat_id):
           #current date
           current_date=datetime.now()
           ban_duration=timedelta(days=NO_BANNED_DAYS)
           unban_date=current_date+ban_duration

           #reset record
           reset_user_record(sender_id,chat_id)

           #remove user 
           await context.bot.send_message(chat_id=update.effective_chat.id, text=f"The message you've sent is a abusive, and you've exceeding the cyberbullying limit, you'll be banned from the group chat for {NO_BANNED_DAYS} days!!",reply_to_message_id=message.message_id)

           await context.bot.ban_chat_member(chat_id=chat_id,user_id=sender_id, revoke_messages=False, until_date=unban_date)

       else:
           #send a message that the person has sent an abusive message 
           await context.bot.send_message(chat_id=update.effective_chat.id, text="The message you've sent is a abusive, be careful or you'll be removed from the group chat soon!",reply_to_message_id=message.message_id)

if __name__ == "__main__":
    application = Application.builder().token(TOKEN).build()
    start_handler = CommandHandler("start", start, filters=~filters.ChatType.GROUPS)
    cyberbullying_handler = MessageHandler(filters.TEXT & filters.ChatType.GROUPS , remove_cyberbullying) #filters.TEXT & filters.ChatType.GROUPS tells us we want only text messages from group chats 
    application.add_handler(start_handler)
    application.add_handler(cyberbullying_handler)
    application.run_polling()

The full source code for the project can be found in my GitHub repository.

Run the bot.py file again, add the bot to group chats and ensure you give it admin permissions, Have people send cyberbullying-related messages and confirm that it removes them from the group chat.

And that's all! You've successfully created a database for your bot to store data.

Conclusion

In this tutorial, you've learned how to create a telegram bot, build a simple machine-learning text classifier, build an SQL database with SQLAlchemy and a lot more. It is important to note that many other ideas and improvements can still be made to this bot to make it better. Some of them include:

  • Making the bot delete messages it perceives as cyberbullying immediately from the group.

  • Source for data to build a model with higher performance and to do a lot more (for example, recognize spam messages, 18+ rated messages).

  • Build a more advanced text classifier using pre-trained embeddings, pre-trained models or more complex neural networks.

  • Use more advanced SQL databases like MySQL, and PostgreSQL instead of the simple SQLite database used.

  • Deploy your bot to a platform like Heroku or Digital Ocean.

A lot more interesting projects can be built on the telegram API and you can learn more from the official API documentation.

Thanks for reading!