Sie sind auf Seite 1von 4

4chan Text Generator using Markov

Chains
August 2nd 2018
TWEET THIS

The Twitter profile picture of Tay


We all remember Tay, the chatbot Microsoft released on Twitter that
quickly turned into a racist, misogynist, and deviant neo-Nazi. This is
what happens when ill-intentioned individuals take advantage of an
“innocent” machine learning algorithm. Before unleashing Tay to the
open world, its algorithm was trained to have the personality of a fun
and loving 16-year-old. I recall finding it quite fascinating (and funny) to
see how quickly the internet could turn a teen-talking chatbot into an
AI nightmare.
Which made me think; what if Microsoft had kept Tay online. What if
we do it intentionally by training its algorithm to be the worst it can
be…iou
Hence this project, where I’ll intend to train a Markov chain on datasets
generated using 4chan APIs :

02sh/4chanMarkovText
4chanMarkovText - Text Generation using Markov Chains feeded by 4chan
APIsgithub.com

APIs, Dataset Generation

4chan is a very special place, it represents a social entity unlike any


other. With the cover of anonymity and the absence of moderation, you
can say almost anything you want without fear of retribution resulting in
blatant racism, homophobia, gore, etc

Based on our goal of creating the worst AI chatbot, 4chan is the perfect
candidate to generate our Datasets from.

Like reddit, 4chan is divided into various boards with their own specific
content and community of users. As a result, each dataset will be
generated using only one board as Input :
In 2012, the 4chan team released a set of APIs to facilitate the work of
developers when scrapping the site. As shown above, I use it to retrieve
all the posts from the first 5 page of the corresponding board (passed as
a parameter). After cleaning (parser function) the data, I finish by
writing the results into a file (ex : ./data/fit.txt).
Markov Chain Text Generator

For a given system or process, a markov chain is basically a sequence of


states with probabilities associated to the transition between two states.

Example of a Markov chain representing the link between activities and the
weather conditions

In the context of Text generation, a markov chain will help you


determine the next most probable suffix word for a given prefix. In order
to produce good results, it is important to provide the algorithm with
relatively big training sets. By fetching all the posts from the first 5 pages
of a given board, we get around 50000 words per dataset.
As a parameter to our markov chain constructor (NewMarkovFromFile),
we give 2 as default value for the prefix length (n) which is the
recommended value to prevent the output text from seeming too
random, or too close to the original text.

Results (NSFW)
Considering the nature of the content, this is not safe for work:

4chan Text Generator - Pastebin.com


a guest Aug 4th, 2018 2,726 Never Not a member of Pastebin yet? Sign Up ,
it unlocks many cool features! Here are some…pastebin.com

What’s next ?

This is just the first part of the project, a lot more can be done :

 Add the Twitter APIs to build an actual 4chan chatbot


 Do some language statistics on the datasets to see what topics are
trending, writing style…
Thanks for reading! If you liked this post, show some love by clicking on
the ‘Clap’ icon. You can also find me on Twitter and Github.

Das könnte Ihnen auch gefallen