Implementing TextRank on my blog search engine

Published on under the Blog Search Engine category.

James' Coffee Blog search engine page

The goal of my blog search engine is to make content as easy to find on my website as possible. Before the latest update, you could use my search engine to find a page based on a term that appeared either in the URL, meta description, or title of a page. However, I kept thinking about how I could enhance the search engine to make content even easier to find. Since I started thinking about this, I have looked at my search engine and made a number of improvements. Most improvements are related to either code efficiency or content discovery. In this article, I’ll share a bit about how I implemented TextRank on my site.

Implementing TextRank

Before now, I have not indexed any full pages on my blog. I indexed titles, meta descriptions, and URLs. I do this because indexing full pages would take up a lot of storage space and would also eat into the efficiency of my search engine. There is not any need for me to keep a record of every word on a page. However, indexing only titles, meta descriptions, and URLs meant that searching for keywords was impossible.

In last week’s Homebrew Website Club meeting, one person mentioned how Natural Language Processing or data analysis could be applied to search. Admittedly, at the time I thought this would all be way over my head. However, when I started to think about how to make content easier to find this idea came back to me. I done a bit of research and found an algorithm called TextRank which lets you find relevant keywords and sentences in a body of text. After reading into it a bit more, I discovered TextRank would let me support searching for various keywords. It turns out there is even a Python implementation of TextRank, pytextrank, which made it easy to get started.

I set up TextRank by following the instructions on the pytextrank PyPI page. I also found this handy resource on how TextRank works which explains the algorithm in detail. My use case is to find relevant keywords in a body of text. TextRank does this by identifying words that follow each other and assigning weights to those terms. The Pagerank algorithm is then used to find how important each word is to the text and TextRank then decides which words are most relevant to the article.

After I had set up TextRank. The first thing to do was to get content from a page to analyse. I first look for an article tag with a “h-entry” microformat, which I have on all of my blog posts. If this does not exist—which would be the case if I was indexing a page that was not a blog post, of which I have quite a few—then my code searches for the div with the “main” ID. For a wider search engine, a lot more would have to go into content discovery. But because I use semantic HTML and I am quite picky about structure, it was quite easy to find the content on a page.

This code lets me find content to run through TextRank:


page = requests.get(u.find("loc").text) # I make a request to the URL of the page I am indexing
page_desc_soup = BeautifulSoup(page.content, "lxml")
			
if page_desc_soup.find("article", {"class": "h-entry"}):
	page_text = page_desc_soup.find("article", {"class": "h-entry"})
else:
	page_text = page_desc_soup.find("div", {"id": "main"})

If there is content on the page I can index, I use TextRank to find the most relevant keywords. I then get the 10 most relevant keywords which I store in my blog database. These are stored in a new “keywords” column. I only store 10 keywords maximum because: (i) every keyword stored takes up more space and I don’t want to use up more than I need to; (ii) the more keywords I add, the less likely it is that a keyword can be considered representative of a page (keywords toward the end of the TextRank results are less relevant than those at the start).

This is the code I use to find 10 keywords that I then store in my index later:


if page_text:
	doc = nlp(page_text.text)
	important_phrases = [p.text for p in doc._.phrases iffloat(p.count) > 1][:10]
else:
	important_phrases = []

If there is main text on a page (not in the navigation bar, footer, etc.), I use the TextRank algorithm (“nlp(page_text.text)”) to find the most important phrases (keywords) on a page. I only let a word become a keyword if it appears more than once in an article.

If I cannot get any main text on a page, I do not add any important phrases to the index. Later, when I actually add a record into my index, these keywords are added so that I can search them using the search engine. On the front-end, the FTS5 full-text search searches through the new “keywords” column where I store keywords every time I make a search query. I discussed what FTS5 does briefly in the first post I wrote on this search engine.

Wrapping up

Now if you search for “European Coffee Trip” you will see the “list of blogs I follow” article appear in the index (because that keyword is tagged to the article). Previously this was not possible but now it is because I support keyword searching. Of course, this is not a perfect way of discovering content. There will be some keywords for which you cannot find a result even if they might be mentioned in an article. However, this update does improve one’s ability to discover content massively by adding keywords as a potential search vector.

Also posted on IndieNews.

Go Back to the Top