Skip to content

The Python toolkit for converting Reddit threads into organized text data. Extract and process Reddit content with ease!

License

Notifications You must be signed in to change notification settings

NFeruch/reddit2text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reddit2Text

reddit2text is the Python library designed to effortlessly transform any Reddit thread into clean, readable text data.

Perfect for prompting to an LLM, performing NLP/data analysis, or simply archiving for offline use, reddit2text offers a straightforward interface to access and convert content from Reddit.

Table of Contents

Installation

Easy install using pip

pip3 install reddit2text

Quickstart

First, you need to create a Reddit app to get your client_id and client_secret, in order to access the Reddit API.

Here's a visual step-by-step guide I created to do this! Alternatively, you can look at Reddit's API documentation.

Then, replace the client_id, client_secret, and user_agent with your credentials.

The user agent can be anything you like, but we recommend following this convention according to Reddit's guidelines: '<app type>:<app name>:<version> (by <your username>)'

This is enough to get started:

from reddit2text import Reddit2Text

r2t = Reddit2Text(
    # replace with your actual creds
    client_id='123abc',
    client_secret='123abc',
    user_agent='script:my_app:v1.0 (by u/reddit2text)'
)

URL = 'https://www.reddit.com/r/AskReddit/comments/1by3p2o/whats_the_stupidest_animal_and_how_has_it/'

output = r2t.textualize_post(URL)
print(output)

Here is an example (truncated) output from the above code! https://pastebin.com/niQTGbys

Extra Configuration

  • max_comment_depth, Optional[str]:
    • Maximum depth of comments to output. Includes the top-most comment. Defaults to None or -1 to include all.
  • comment_delim, Optional[str]:
    • String/character used to indent comments according to their nesting level. Defaults to | to mimic reddit.
r2t = Reddit2Text(
    # credentials ...
    max_comment_depth=3,  # all comment chains will be limited to a max of 3 replies
    comment_delim='#'  # each comment level will be preceded by multiples of this string
)

Current Features

  • Convert any Reddit thread (the post + all its comments) into structured text.
  • Include all comments, with the ability to specify the maximum comment depth.
  • Configure a custom comment delimiter, for visual separation of nested comments.

Have a Feature Idea?

Simply open an issue on github and tell me what should be added to the next release!

Planned Features

  • Comprehensive Formatting/Saving
    • Being able to save to a file location as .txt, .csv, .json, or to your clipboard!
  • Filtering/Sorting
    • Filter/sort comments based on upvotes, author name, body content, number of replies, etc. Also add in the ability to get the Top N comments.
  • Extra data fields
    • Access extra information for each post/comment, like whether it's NFSW or not and when it was created
  • Image/video support
    • Enable mining of not just text threads, but also image and video posts
  • CLI output
    • Add a progress bar to the terminal for threads with a large amount of comments
  • Anonymize usernames
    • Give the ability to obfuscate usernames, while still preserving their uniqueness across all comments
  • Iterate across many posts at once
    • Given a subreddit as the input and the sorting method (hot, top, new, etc.), loop over multiple posts at once and textualize them

Contributions

Contributions to reddit2text are always welcomed! I'm just a person that made something I think is useful, so any help is appreciated. You can always submit a pull requests or add an issue to the GitHub repository.

License

reddit2text is released under the Apache License 2.0. See the LICENSE file for more details.