Scraping 1000’s of News Articles using 10 simple steps

Web-scraping using python is very simple to do if you follow along with these simple 10 steps.

Scraping 1000’s of News Articles using 10 simple steps

Web-scraping using python is very simple to do if you follow along with these simple 10 steps.

Photo by

michael podger

on

Unsplash


Web Scraping Series: Using Python and Software


Part-1:


Scraping web pages without using Software: Python


Part-2:


Scraping web Pages using Software: Octoparse


Table Of Content


1.

Introduction


1.1

Why This article?


1.2

Who should read this article?


2.

Overview


2.1

A brief introduction to webpage design and HTML


2.2

Web-scraping using BeautifulSoup in PYTHON


3.

Suggestion & conclusion


3.1

Full Code

INTRODUCTION

WHY THIS ARTICLE?

Aim of this article is to scrape news art


i

cles from different websites using Python. Generally, web scraping involves accessing numerous websites and collecting data from them. However, we can limit ourselves to collect large amounts of information from a single source and use it as a dataset.

Web Scraping is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format.

So, I get motivated to do web scraping while working on my Machine-Learning project on

Fake News Detection System

. Whenever we begin a machine learning project, the first thing that we need is a dataset. While there are many datasets that you can find online with varied information, sometimes you wish to extract data on your own and begin your own investigation. I was needed with a dataset that I couldn’t able to find anywhere according to my need.

Source: giphy.com

So this motivated me to make my own Dataset for my project accordingly. And that’s how I did my project from the scratch. My Project was basically based on classifying different news articles into two main categories

FAKE

&

REAL

.

FAKE-NEWS DATASET

For this project, The first task was to get a dataset which is already labeled with “

FAKE

”, so this can be achieved by scraping data from some verified & certified news websites, on which we can rely on for fact of news articles and it is really a very difficult task to get genuine “

FAKE NEWS

”.



I go through these news websites to get my FAKE-NEWS Dataset

Boom Live

Snopes

Politifact

AllSides


But honestly speaking, I end up scraping data from one website i.e., Politifact.

And there is a strong reason to do so, As you go through the listed links up there, you will conclude that we needed a dataset with already labeled category i.e., “

FAKE

” but also we don’t want our news articles to be in a modified form as such. We want to extract a raw news article without any keywords specifying whether the given news article in a dataset is “FAKE” or not.

So for example, If you go through the link “BoomLive.in”, you will find that the news articles specifying “FAKE” are not in its actual form and altered on basis of some analysis of the fact-checking team. So this altered text on model training in ML will give us a biased result every time and the model that we made using this kind of dataset will result into a dumb one which can only predict news articles having keywords like “FAKE”, “DID?”, “IS?” in it and will not be going to perform well on a new testing set of data.

That’s why we use

Politifact

to scrape our “FAKE-NEWS DATASET”.

REAL-NEWS DATASET

The second task was to create a “

REAL-NEWS

” dataset, So that was easy if you are scraping news-articles from trusted or verified news websites like “TOI”, “IndiaToday”, “TheHindu” & so many…So we can trust these websites that they are listing the factual/actual data and even if not, then we are assuming the same to be true and will train our model accordingly.

But for my project, I scrape data for

real

and

fake

from one website only (i.e., Politifact.com), since I am getting what I needed from it, and also it is advisable when we are scraping data using python to use only one website at a time. Although you can scrape multiple pages of that particular website altogether in one module by just running an outer for loop.

WHO SHOULD READ THIS ARTICLE?

Whoever is working on some projects where you need to scrape data in thousands, this article is definitely for you ?. It doesn’t matter if you are from a programming background or not, because there are many times when people other than programmers from different backgrounds needed data as per their project, survey, or whatsoever purpose.

But non-programmers find it difficult to understand any programming language, So I will make scrapping easy for them too by introducing some software from which they can scrape any kind of data in a huge amount easily.

Although Scraping using python is not that difficult if you follow along with me while reading this blog ?, the only thing that you need to focus on is the HTML source code of a webpage. Once, you able to understand how webpages are written in HTML and able to identify attributes and elements of your interest, you can scrape any website.

For non-programmers, if you want to do web-scraping using python, just focus on HTML code mainly, python syntax is not that difficult to understand, It’s just some libraries, some functions, and keywords that you needed to remember and understand. So I tried to explain every step with transparency, I hope at the end of this series, you will be able to scrape different types of the layout of webpages.

OVERVIEW

This post covers the first part:

News articles web scraping using PYTHON.

We’ll create a script that scrapes the latest news articles from different newspapers and stores the text, which will be fed into the model afterward to get a prediction of its category.

A brief introduction to webpage design and HTML:

If we want to be able to extract news articles (or, in fact, any other kind of text) from a website, the first step is to know how a website works.


We will follow an example to understand this:

When we insert an URL into the web browser (i.e. Google Chrome, Firefox, etc…) and access to it, what we see is the combination of three technologies:


HTML (HyperText Markup Language):

it is the standard language for adding content to a website. It allows us to insert text, images, and other things to our site. In one word, HTML defines the content of every webpage on the internet.


CSS (Cascading Style Sheets):

this language allows us to set the visual design of a website. This means it determines the style/presentation of a webpage including colors, layouts, and fonts.


JavaScript:

JavaScript is a dynamic computer programming language. It allows us to make the content and the style interactive & provides a dynamic interface between client-side script and user.

Note that these three are programming languages. They will allow us to create and manipulate every aspect of the design of a webpage.

Let’s illustrate these concepts with an example. When we visit the Politifact page, we see the following:

Screenshot from Politifact website

If we disabled

JavaScript

, we would not be able to use this pop-up anymore, as you can see, we are not able to see a video pop up window now:

Screenshot from Politifact website

If we delete

CSS

element from our web-page after finding it using ctrl+F on inspect window, we will see something like this:

Screenshot from Politifact website

So, At this point, I will be going to ask you a question.





If you want to extract the content of a webpage via web-scraping, where do you need to look up?



So, At this point, I hope you guys are clear about what kind of source code do we need to scrape. Yeah, you are absolutely right, If you are thinking about

HTML

?

So, the last step before performing web scraping methods is to understand the bit of the HTML language.


HTML


HTML

language is a “

hypertext markup language

” that defines the content of a webpage and constitute of elements and attributes, for scraping data, you should be familiar with inspecting those elements.

An element could be a heading, paragraph, division, anchor tag & so many…

An attribute could be that the heading is in bold letters.

These tags are represented with an opening symbol

<tag>

and closing symbol

</tag>

e.g.,

<p>This is paragraph.</p>

<h1><b>This is heading one in bold letters</b></h1>

Web-scraping using BeautifulSoup in PYTHON

Enough talk, show me the code.

Source: giphy.com

Step-1: Installing Packages

We will first begin with installing necessary packages:

1.

beautifulsoup4

To install it, Please type the following code into your python distribution.

! pip install beautifulsoup4


BeautifulSoup under bs4 package is a library used to parse HTML & XML docs into python in a very easy & convenient way and access its elements by identifying them with their tags and attributes.

It is very easy to use, yet very powerful package to extract any kind of data from the internet in just 5–6 lines.

2.

requests

To install it, use the following command in your IDE or use this command without an exclamation mark in a command shell.

! pip install requests


So as to provide BeautifulSoup with the HTML code of any page, we will need with the requests module.

3.

urllib

To install it, use the following command:

! pip install urllib


urllib module is the URL handling module for python. It is used to fetch URLs(Uniform Resource Locator)

Although, here we are using this module for a different purpose, to call libraries like:

time(using which we can call sleep() function to delay or suspends execution for the given number of seconds.

sys(It is used here to get exception info like type of error, error object, info about the error.

Step-2: Importing Libraries

Now we will import all the required libraries:

1.

BeautifulSoup

To import it, use the following command onto your IDE

from bs4 import BeautifulSoup


This library helps us with getting HTML structure of any page that we want to work with and provides functions to access specific elements and extract relevant info.

2.

urllib

To import it, type following command

import urllib.request,sys,time

urllib.request: It helps in defining functions & classes which help in opening URLs

urllib.sys: Its functions & classes helps us with retrieving exception info.

urllib.time : Python has a module named time which provides several useful functions to handle time-related tasks. One of the popular functions among them is sleep().

3.

requests

To import it, just type import before this library keyword.

import requests


This module allows us to send the HTTP requests to web-server using python. (HTTP messages consist of requests from client to server and responses from server to client.)

4.

pandas

import pandas as pd


It is a high-level data-manipulation tool that we needed to visualize our structured scraped data.

will use this library to make DataFrame(Key data structure of this library). DataFrames allow us to store and manipulate tabular data in rows of observations and columns of variables.

import urllib.request,sys,timefrom bs4 import BeautifulSoupimport requestsimport pandas as pd

Step-3: Making Simple requests

with the

request

module, we can get the HTML content and store into the

page

variable.

Make a simple get request(just fetching a page)

#url of the page that we want to Scarpe#+str() is used to convert int datatype of the page no. and concatenate that to a URL for pagination purposes.URL = ‘https://www.politifact.com/factchecks/list/?page=’+str(page)#Use the browser to get the URL. This is a suspicious command that might blow up.page = requests.get(url)


Since,


requests.get(url)


is a suspicious command and might throw an exception, we will call it in a try-except block

try: # this might throw an exception if something goes wrong. page=requests.get(url) # this describes what to do if an exception is thrown except Exception as e:
# get the exception information error_type, error_obj, error_info = sys.exc_info()
#print the link that cause the problem print (‘ERROR FOR LINK:’,url)
#print error info and line that threw the exception print (error_type, ‘Line:’, error_info.tb_lineno)
continue

We will also use an outer for loop for pagination purposes.

Step-4: Inspecting the Response Object

I. See what response code the server sent back (useful for

detecting 4XX or 5XX errors.

page.status_code


Output:

The HTTP 200 OK success status response code indicates that the request has succeeded.

II. Access the full response as text(get the HTML of the page in a big string)

page.text


Output:

It will return the HTML content of a response object in Unicode.


Alternative:

page.content


Output:


Output:

whereas, It will return the content of response in bytes.

III. Look for a specific substring of text within the response.

if “Politifact” in page.text: print(“Yes, Scarpe it”)

IV. Check the response’s Content-Type (see if you got back HTML,

JSON, XML, etc)

print (page.headers.get(“content-type”, “unknown”))


Output:

Step-5: Delaying request time

Next with the time module, we can call sleep(2) function with a value of 2 seconds. Here it delayed sending requests to a web-server by 2 seconds.

time.sleep(2)


The sleep() function suspends execution of the current thread for a given number of seconds.

Step 6: Extracting Content from HTML

Now that you’ve made your HTTP request and gotten some HTML content, it’s time to parse it so that you can extract the values you’re looking for.


A)Using Regular Expressions

Using Regular Expressions for looking up HTML content is strongly not recommended at all.

However, regular expressions are still useful for finding specific string patterns like prices, email addresses, or phone numbers.

Run a regular expression on the response text to look for specific string patterns:

import re # put this at the top of the fileprint(re.findall(r’$[0-9,.]+’, page.text))


Output:


B)Using BeautifulSoup’s object Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work

soup = BeautifulSoup(page.text, “html.parser”)

The below-listed command will Look for all the tags e.g.,

<li>

with specific attribute ‘o-listicle__item’

links=soup.find_all(‘li’,attrs={‘class’:’o-listicle__item’})


INSPECTING WEBPAGE

For being able to understand above code, you need to inspect the webpage & please do follow along:

1)Go to listed URL above

2)press ctrl+shift+I to inspect it.

3)This is how your ‘Inspect window’ will look like:

press ctrl+shift+C to select an element in the page to inspect it or go to the leftmost arrow in the header of the Inspect window.

4)For getting above specific element & attribute in inspect window

First, tries to go to every section of the webpage, & see changes on your inspect window, you will easily grasp the idea behind how webpages are working and which element is what and what particular attribute is contributing to the webpage.

When done with the above step, now I am assuming that you can understand the working of the above element

<li>

and it’s attribute.

Since I needed the news section of a particular article, I go to that article section by selecting the inspect element option in the inspect window, It will highlight that article section on the web-page and its HTML source on Inspect Window. Voila!✨


Did you able to locate the same tag on your machine?

If yes, You are all set to understand every bit of HTML tags whatsoever I have used in my code.

Continuing with my code: ?

print(len(links))


This command will help you to inspect how many news articles are there on a given page.


Help you understand accordingly, up to what level you need to paginate your loop for extracting huge data.

Step-7: Finding elements and attributes

Look for all anchor tags on the page (useful if you’re building a crawler and need to find the next pages to visit)

links = soup.find_all(“a”)

It will find a division tag under

<li>

tag where div tag should contain listed or specific attribute value. Here ‘j’ is an iterable variable that is iterating over response object ‘Links’ for all news articles listed on a given page.

Statement = j.find(“div”,attrs={‘class’:’m-statement__quote’})

text.strip() function will return text contained within this tag and strip any kind of extra spaces, ‘n’,’t’ from the text string object.

Statement = j.find(“div”,attrs={‘class’:’m- statement__quote’}).text.strip()


Bravo! ? We have scraped the first attribute i.e., Statement of our dataset

In the same division section, It will look for anchor tag and return with the value of the hypertext link. Again, strip() function is used to get our values organized so that our CSV file looks good.

Link=j.find(“div”,attrs={‘class’:’m-statement__quote’}).find(‘a’)[‘href’].strip()

For getting Date attribute, you need to inspect web-page first, As there is a string contained with it. So calling text function without specifying indexing, you will get something like this

But we don’t need text other than the date, So I use indexing. Although you can clean your attribute later using some regex combinations. ‘footer’ is an element that contained the required text.

Date = j.find(‘div’,attrs={‘class’:’m-statement__body’}).find(‘footer’).text[-14:-1].strip()

Here also, I have done everything same as before except get(), which is extracting the content of an attribute passed(i.e., title)

Source = j.find(‘div’, attrs={‘class’:’m-statement__author’}).find(‘a’).get(‘title’).strip()

Since, For my project, I needed a dataset that is not already altered and also, I need to know already about thousands of articles that which article lies in what category for my training data. and No-one can do that manually. So, On this website, I do find articles attached already with labels but the text is not retrievable because it is contained in an image. For this kind of specific task, you can use get() to retrieve particular text effectively. Here, I am passing ‘alt’ as an attribute to get(), which contains our Label text.

Label = j.find(‘div’, attrs ={‘class’:’m-statement__content’}).find(‘img’,attrs={‘class’:’c-image__original’}).get(‘alt’).strip()

In the below lines of code, I have put all concepts together & tried to fetch details for five different attributes of my Dataset.

for j in links: Statement = j.find(“div”,attrs={‘class’:’m-statement__quote’}).text.strip() Link=st.find(‘a’)[‘href’].strip() Date = j.find(‘div’,attrs={‘class’:’m-statement__body’}).find(‘footer’).text[-14:-1].strip() Source = j.find(‘div’, attrs={‘class’:’m-statement__author’}).find(‘a’).get(‘title’).strip() Label = j.find(‘div’, attrs ={‘class’:’m-statement__content’}).find(‘img’,attrs={‘class’:’c-image__original’}).get(‘alt’).strip() frame.append([Statement,Link,Date,Source,Label])upperframe.extend(frame)

Step-8: Making Dataset

Append each attribute value to an empty list ‘frame’ for each article

frame.append([Statement,Link,Date,Source,Label])

Then, extend this list to an empty list ‘upperframe’ for each page.

upperframe.extend(frame)

Step-9: Visualising Dataset

If you wanted to visualize your data on Jupiter, you can use pandas DataFrame to do so.

data=pd.DataFrame(upperframe, columns=[‘Statement’,’Link’,’Date’,’Source’,’Label’])data.head()

Step-10: Making CSV file & saving it to your machine


A) Opening & writing to file

The below command will help you to write CSV file and save it to your machine in the same directory as where your python file has been saved in

filename=”NEWS.csv” f=open(filename,”w”) headers=”Statement,Link,Date, Source, Labeln” f.write(headers) …. f.write(Statement.replace(“,”,”^”)+”,”+Link+”,“+Date.replace(“,”,”^”)+”,”+Source.replace(“,”,”^”)+”,”+Label.replace(“,”,”^”)+”n”)

This line will write each attribute to a file with replacing any ‘,’ with ‘^’.

f.write(Statement.replace(“,”,”^”)+”,”+Link+”,”+Date.replace(“,”,”^”)+”,”+Source.replace(“,”,”^”)+”,”+Label.replace(“,”,”^”)+”n”)

So, when you run this file on command shell, It will make a CSV file in your .py file directory.

On opening it, you might see weird data if you don’t use strip() while scraping. So do check it without applying strip() and if you don’t replace ‘^’ with ‘,’, It will also look weird.

So replace it using these simple steps:

open your excel file (.csv file)

Press ctrl+H (a pop-up window will come asking about find what & replace with)

give ‘

^

‘ value to ‘

find what

‘ field and give ‘

,

‘ value in ‘

replace with

‘ field.

Press Replace All

Click Close & Woohoo!? You are all done with having your dataset in perfect form. and don’t forget to close your file with the following command after done with both the for loops,

f.close()

and running the same code again and again might throw an error if it has already created a dataset using the file writing method.


B) converting dataframe into csv file using to_csv()

So, instead of this lengthy method, you can opt for another method: to_csv() is also used to convert the data frame into a CSV file and also provide an attribute to specify the path.

path = ‘C:\Users\Kajal\Desktop\KAJAL\Project\Datasets\’data.to_csv(path+’NEWS.csv’)

To avoid the ambiguity and allow portability of your code you can use this:

import osdata.to_csv(os.path.join(path,r’NEWS.csv’))

this will append your CSV name to your destination path correctly.

SUGGESTION & CONCLUSION

Although I will suggest using the first method using open file and writing to it and then close it, I know it is a bit lengthy & tacky to implement but at least it will not provide you with ambiguous data as to_csv method mostly does.

See in the above image, how it extracts ambiguous data for the Statement attribute.

So, instead of spending hours cleaning your data manually, I would suggest writing a few extra lines of code specified in the first method.

Now, you are done with it.✌️



IMPORTANT NOTE:



If you tried to copy-paste my source code for scraping different websites & run it, It might possible that it will throw an error. In fact, It will definitely throw an error because each webpage’s layout is different & for that, you need to make changes accordingly.

Full Code


The Dataset:

This article is the first part of the series of web-scraping and for those who come from non-technical backgrounds, read the second part of this series

here.

I hope you will find it useful and liked my article.? Please feel free to share your thoughts and hit me up with any queries you might have. You can reach me via following :

Subscribe to my


YouTube channel


for video contents coming soon


here

Follow me on


Medium

Connect and reach me on


LinkedIn

Check out my other Blogs:

Source URL: Read More
The public content above was dynamically discovered – by graded relevancy to this site’s keyword domain name. Such discovery was by systematic attempts to filter for “Creative Commons“ re-use licensing and/or by Press Release distributions. “Source URL” states the content’s owner and/or publisher. When possible, this site references the content above to generate its value-add, the dynamic sentimental analysis below, which allows us to research global sentiments across a multitude of topics related to this site’s specific keyword domain name. Additionally, when possible, this site references the content above to provide on-demand (multilingual) translations and/or to power its “Read Article to Me” feature, which reads the content aloud to visitors. Where applicable, this site also auto-generates a “References” section, which appends the content above by listing all mentioned links. Views expressed in the content above are solely those of the author(s). We do not endorse, offer to sell, promote, recommend, or, otherwise, make any statement about the content above. We reference the content above for your “reading” entertainment purposes only. Review “DMCA & Terms”, at the bottom of this site, for terms of your access and use as well as for applicable DMCA take-down request.

Acquire this Domain
You can acquire this site’s domain name! We have nurtured its online marketing value by systematically curating this site by the domain’s relevant keywords. Explore our content network – you can advertise on each or rent vs. buy the domain. Buy@TLDtraders.com | Skype: TLDtraders | +1 (475) BUY-NAME (289 – 6263). Thousands search by this site’s exact keyword domain name! Most are sent here because search engines often love the keyword. This domain can be your 24/7 lead generator! If you own it, you could capture a large amount of online traffic for your niche. Stop wasting money on ads. Instead, buy this domain to gain a long-term marketing asset. If you can’t afford to buy then you can rent the domain.

About Us
We are Internet Investors, Developers, and Incubators- operating a content network of several thousand sites while federating 20+ eCommerce and SaaS startups. With our proprietary “inverted incubation” model, we leverage a portfolio of $15M+ in valued domains to impact online trends, traffic, and transactions. We use robotic process automation, machine learning, and other proprietary approaches to power our content network. Discover our work!

Share

Generated by Feedzy