An Approach To Build An Online News Distribution System

Abhijit Roy
Sep 26, 2020
8 min read

News has always been a very significant part of our society. In the past, we mostly depended on the news channels and newspapers to get our feeds and keep ourselves updated. Currently, in the fast-paced world, news media and agencies have started using the internet to reach the readers. The venture has proven to be very helpful as it has allowed the houses to extend their reach among readers.

In the present world, there are numerous media outlets, so, it can be easily established that it is impossible for a person to go and gather news from all the outlets, owing to the busy life schedules. Besides, each media outlet covers each story differently. Some readers like to compare stories and read the same story from multiple houses to get the full idea of an event. All these requirements are solved by a type of application that is gaining popularity currently, Online News Distribution applications. These applications aim to gather news from multiple sources and provide to a user as a feed. In this article, we will look at an approach toward building such an application.

The Idea

The main component of such an application is the news of course. I have used four of the most popular media houses in India for the application, to serve as the sources. All of the media houses possess their own website, from where we scrape the headline links and the stories. We will use the extractive text summarization to extract the gist points from the stories in 3 to 5 sentences. We will store the information collected along with the sources, i.e, the names of the publishing media houses, date, time, and title of the story in datewise files. Each datewise file will give the feed of that particular date.

Now, we can extract another piece of information from the story title, that is the subject of the story. Each title has some relevant information, it may be the name of a person, a country, an organization, or any important topic of that time, for instance, COVID-19. The names or topics are mostly the subjects of the story. We will be extracting these words of interest from the title and we will be using them as labels or tags for the corresponding stories. We will store these labels also along with the titles in the files.

An app can be used by many users of different types, so, we must create a filtering or recommender mechanism to customize a user’s feed according to his/her interests. For this, we will need to create a login system, to separately record the type of stories each user reads, and recommend to him/her only based on his/her account. We will be maintaining a database that will contain the user’s name, email, phone number(optional), and password. The email will be our unique key here.

We will also be maintaining two JSON files, one to record the stories each user reads and the corresponding labels. In this case, we use the user’s email as the key. The labels will keep telling us the topics the user is interested in. The other file records the users who read a story. In this file, we form a unique key in the format:

Publishing House+$+ Publishing Date+$+Story Title

This unique key will be used as the key in our JSON file. Each key will have the emails of the users who read the story. The idea behind this is, the labels attached in the user’s file to each email will allow us to do content-based recommendations, and if we use both the files together, we can create a full user-item interaction matrix, which can be used to create collaborative filtering based recommendations.

Now, we can offer the user three types of distributions of news:

Latest Feed: The fresh feed for every day
Most Popular stories
Customized Feed: May contain unvisited feed from the last 2–3 days but will be tuned according to the user’s interests.

One thing worth noticing is the Latest feed is neither tuned nor popular most, still, it is essential, in order to make sure all the stories reach a user and to ensure a bit of randomness, or the entire thing will be too biased. The latest story will be the current date’s feed only. We will use the JSON file containing the records of the emails of all users who visited the story for each story, to obtain the popularity of the story. The popularity of a story is simply the total length of the record of emails for the story.

The next thing is we must do is, add a search option. We as readers often want to read about a particular topic. This option will help our users to use the feature.

Lastly, we need to give a “similar stories” option. If we visit an e-commerce site, if we buy a product, it shows us similar products to ease the browsing for the user. We will use a similar feature. If a user selects to read a particular story, we will show him/her similar stories, in order to make his/her experience better.

We have seen the whole idea, now, let’s jump into the application part.

Application Let’s first see how the news websites look and how can we easily scrape the required data.

The above image shows the story headlines in red and the corresponding links in the HTML script in the green. We will need to extract the news story links from the code and go to the stories and extract them also.


from bs4 import BeautifulSoup
import requestsdef News_18_scraper():
    URL="https://www.news18.com/"
    r=requests.get(URL)
    #print(r)
    soup = BeautifulSoup(r.content,'html5lib')  
    #print(soup)
    heads={}
    sub= soup.find('div',attrs={'class': 'lead-story'})
    #print(sub)
    rows=sub.findAll('p')
    #print(rows)
    for row in rows:
 
        head=row.text
        heads[head]={}
        heads[head]['Source']='News18'
        #print(head)
        #print(row.a["href"])
        heads[head]['link']=row.a["href"]
 
    sub= soup.find('ul',attrs={'class': 'lead-mstory'})
    rows=sub.findAll('li')
    for row in rows:
        head=row.text
        heads[head]={}
        heads[head]["Source"]='News18'
        heads[head]["link"]=row.a["href"]
 
    return heads

The above piece of code is used to extract the links of the news stories for this particular media house.

The above image shows how a story webpage looks. It shows the title of the story in green, the story in red, and the story in the source code in blue. We need to scrape all of the required data.

def extractor_n18(news):
    for n in news.keys():
        #print(n)
        link=news[n]['link']
        
        r=requests.get(link)
        #print(link)
        
        soup = BeautifulSoup(r.content, 'html5lib')
        Briefs=[]
        #print(link)
        #print(soup)
        sub=soup.find("title")
        news[n]['Titles']=[sub.text]
        tit=sub.text
        flag=0
        try:
            flag=1
            text=""
            sub=soup.find('div',{'class':'lbcontent paragraph'})
            #print(sub)
            text+=sub.text+"\n"
            sub_2=soup.find('div',{'id':'article_body'})
            text+=sub_2.text
            summary=summarizer(text)
           #print(summary)
            #print(text)
        except:
            flag=0
            i=1
            
        if flag==0:
            text=""
            try:
                sub=soup.find('article',{'class':'article-content-box first_big_character'})
                rows=sub.findAll('p')
                for row in rows:
                    text+=row.text+"\n"
                summary=summarizer(text)
                    
            except:
                summary=tit
        #print(summary)
        news[n]['gists']=summary
        date=datetime.today().strftime('%Y-%m-%d')
        time=str(datetime.now().time())
        
        news[n]['Date']=date
        news[n]['Time']=time
        
    return news

The above code can be used to extract the stories for the news agency.

I have created my own text summarizer using the Page Rank algorithm.


def pagerank(text, eps=0.000001, d=0.85):
    score_mat = np.ones(len(text)) / len(text)
    delta=1
    while delta>eps:
        score_mat_new = np.ones(len(text)) * (1 - d) / len(text) + d * text.T.dot(score_mat)
        delta = abs(score_mat_new - score_mat).sum()
        score_mat = score_mat_newreturn score_mat_new

The above code shows the page rank algorithm. I will provide the link to the full code at the end.

Now, we have four such news sources. We must individually scrape for all and then compile in a database.

import pandas as pd
def Merge(dict1, dict2, dict3, dict4): 
    res = {**dict1, **dict2, **dict3, **dict4} 
    return resdef file_creater(date):
    news_times=times_now_scraper()
    times_now=extract_news_times(news_times)
    news_rep=republic_tv_scraper()
    republic_tv=extract_news_rep(news_rep)
    news_it=india_today_scraper()
    india_today=extractor_it(news_it)
    n_18=News_18_scraper()
    News_18=extractor_n18(n_18)
    
    Merged=Merge(times_now,republic_tv,india_today,News_18)
    
    Merged_df=pd.DataFrame(Merged)
    Merged_df_final=Merged_df.transpose()
    df_final=Merged_df_final.reset_index()
    df_final_2=df_final.drop(['index'],axis=1)
    df_final_2.to_csv('feeds/Feed_'+date+'.csv',index=False)
    get_names('feeds/Feed_'+date+'.csv')
    
    return df_final_2

The above code obtains all the news together and forms a data csv file for the dates passed.

The get_names() function extracts the names or topics from the story titles, using the Named Entity Recognition feature of the Spacy Library.

After the full processing, we obtain a CSV containing the feed file for each date.

The above images describe how our news file databases look.

Next, we move to the user control parts. It starts with login and signup pages.

import pandas as pd
def signup():
    Name=input("Name:")
    Email=input("Email:")
    Phone=input("Phone:")
    Password=input("Password:")
    Con_password=input("Confirm Password:")
    if Con_password!=Password:
        print("Passwords don't match. Please Retry")
        signup()
    df=pd.read_csv('user_data.csv')
    df_2=df[df['email']==Email]
    if len(df_2)!=0:
        print("Email already exists try different email")
        signup()
        
    wr=open('user_data.csv','a')
    wr.write(Name+","+Email+","+Password+","+Phone+"\n")
    wr.close()    
    print("Now please log in")
def login():
    print("1 to Signup, 2 to Login")
    ch=int(input())
    if ch==1:
        signup()
    df=pd.read_csv('user_data.csv')
    Email=input("Email:")
    Pass=input("Pass:")
    
    df_2=df[df['email']==Email]
    #print(df_2)
    
    if len(df_2)==0:
        print("Email not found, try again")
        login()
    
    if str(df_2.iloc[0]['password'])==Pass:
        print("Welcome "+df.iloc[0]['Name'])
        surf(Email)
        
    else:
        print("Password Wrong, try again")
        login()

The above snippet handles login and signup.

The above image demonstrates the signup portion. It has certain checks like if the email already exists it tells to signup with a different email.

The above image shows the structure of the users’ database. Now, let’s take a look at the two JSON file structures.

The first file user_records.json is shown in the above image. As discussed, it shows, we have recorded the news and corresponding labels, visited by the user with email XYZ@gmail.com.

The image shows our second file stories_records.json. As seen earlier, it creates a key and logs the email of the users that visited the story. The length of the lists of visitors provides us the popularity of the story.

Now, let’s return back to the working of the application.

It shows the working. As soon as we log in, it creates a session with email id and keeps logging the actions against the email id. It provides us with the latest feed and later provides the options as:

Searching
Reading from the feed provided
Popular stories
Customized stories

If we want to read from the feed, it tells us to enter the index. It launches the chosen story and also gives us a similar story list to choose from.

For similar stories, I have just ranked the story titles based on the cosine similarity with the chosen story title after a bit of preprocessing. One thing to keep in mind is, we will be only using the feeds for the last 3 consecutive days, i.e, if the user is using the app on the 4th, our feed will have data from 2nd to 4th. This will prevent our application to show super old feeds and also reduce computation.

from nltk.tokenize import sent_tokenize, word_tokenize 
def clean_sentence(sentence):
    #extracts=sent_tokenize(article)
    sentences=[]
  
    #print(extract)
    clean_sentence=sentence.replace("[^a-zA-Z0-9]"," ")   ## Removing special characters
    #print(clean_sentence)
    obtained=word_tokenize(clean_sentence) 
    #print(obtained)
    sentences.append(obtained)return sentence
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.cluster.util import cosine_distance
def get_similarity(sent_1,sent_2,stop_words):
  
  sent_1=[w.lower() for w in sent_1]
  sent_2=[w.lower() for w in sent_2]total=list(set(sent_1+sent_2)) ## Removing duplicate words in total setvec_1= [0] * len(total)
  vec_2= [0] * len(total)## Count Vectorization of two sentences
  for w in sent_1:
      if w not in stop_words:
      vec_1[total.index(w)]+=1for w in sent_2:
    if w not in stop_words:
     vec_2[total.index(w)]+=1return 1-cosine_distance(vec_1,vec_2)

The above codes are used to preprocess the data and obtain the cosine similarity. The codes remove any special character, convert everything to lowercase, and also it removes the Stop Words.

Next, we move to the searching portion. The users enter a topic, we will pick the topic and again obtain the cosine similarities with the titles of the stories individually, and sort them in non-increasing order to get the search results. We may have used the labels, but they are not manually extracted so, it may cause the performances to decrease.

def search(email,df):
    clear()
    search=input("search")
    df_temp=df
    sim=[]
    for i in range(len(df)):
        try:
            title=ast.literal_eval(df.iloc[i]['Titles'])[0]
            cleaned_1=clean_sentence(search)
            cleaned_2=clean_sentence(title)
            stop_words = stopwords.words('english')
            s=get_similarity(cleaned_1,cleaned_2,stop_words)
            if s<0.95:
                sim.append(s)
            else:
                sim.append(0)
        except:
            sim.append(0)
            
    df_temp['Sim']=sim
    df_temp.sort_values(by=['Sim'], inplace=True,ascending=False)
    #print(df_temp.head())
    print("\n\n Top 5 Results \n")
    for i in range(5):
        
            res = ast.literal_eval(df_temp.iloc[i]['Titles']) 
            print(str(i+1)+"-> "+res[0])
            print(df_temp.iloc[i]['Source']+" , "+df_temp.iloc[i]['Date'])
        
            print('\n\n')
        
    ind=int(input("Please Provide the index of the story"))
        #print(str(stories_checked.iloc[indices[ind-1]]['link']))
    
            #ind=int(input("Please Provide the index of the story"))
    webbrowser.open(df_temp.iloc[ind-1]['link'])
    time.sleep(3)
    
    try:
    
            file_u = open('user_records.json')
            users=json.load(file_u)if email not in users.keys():
                users[email]={}
                
                users[email]['news']=[df_temp[ind-1]['Source']+df_temp.iloc[ind-1]['Date']+ast.literal_eval(df_temp.iloc[ind-1]['Titles'])[0]]
                lab=[z for z in ast.literal_eval(df_temp.iloc[ind-1]['labels'])]
                users[email]['labels']=lab
            else:
                users[email]['news'].append(df_temp.iloc[ind-1]['Source']+df_temp.iloc[ind-1]['Date']+ast.literal_eval(df_temp.iloc[ind-1]['Titles'])[0])
                lab=[z for z in ast.literal_eval(df_temp.iloc[ind-1]['labels'])]
                for l in lab:
                    users[email]['labels'].append(l)
        
            with open("user_records.json", "w") as outfile: 
                json.dump(users, outfile)
    
            file_s = open('story_records.json')
            stories=json.load(file_s)
            key=df_temp.iloc[ind-1]['Source']+df_temp.iloc[ind-1]['Date']+ast.literal_eval(df_temp.iloc[ind-1]['Titles'])[0]if key not in stories.keys():stories[key]=[email]
            else:
                stories[key].append(email)with open("story_records.json", "w") as outfile: 
                json.dump(stories, outfile)

The above code is used for searching. The function takes in the email and the news database in which we have to search.

The above image shows the search feature. If we search COVID-19, the application gives us 5 top matches for COVID-19 with the media house and the publication date.

Content-Based Filtering

We are not going to use Content Based filtering fully. We are just going to use the idea behind the approach. We will pick up the labels visited by the user from the JSON files. We will consider only the last 20 labels visited because if we consider more the recommendations will not shift with the user’s shift in interests. Next, we will compare the overlap between the labels viewed by the user and the labels of the stories individually, and will recommend the top 10 overlaps. Now, one thing to notice that, we won't be showing the stories that the user has already seen, we will externally make the overlap 0 to prevent this from happening. We can get the information from the story_records.json file.

To check the similarity of the story labels, we will use cosine similarity again. One thing to note is each time we are using count vectorization method to obtain the similarity.