Football News Aggregator through Web Scraping

I like to follow goal.com, givemesport.com and sportskeeda.com for football news. Though, all of the European football interests me but Spanish football is where my allegiance rests. Things are great as they stand now, but if there was just a way to gather the links to news articles about La Liga from these websites and order them by recency without actually visiting these sites. That could save me some effort and frankly, I have some time to kill these holidays.

But before we go further, I want to make it clear that the intention is not to republish articles from these websites. All I am trying to do is, just collect the links to articles about Spanish football from these sites for easy navigation. To acknowledge that these articles originally belong to the respective sites, I have made sure to append the site name to the article title if in case it was missing.

I have tried to keep the script as simple as possible. So, here is how it goes:

  • Import the necessary libraries
  • Provide the url to the home page for Spanish soccer news
  • Hit the url and fetch the response
  • Extract the HTML and identify the tags to La Liga news articles’ links
  • Collect all the links through BeautifulSoup
  • Crawl each of the collected links
  • Extract title and published date through BeautifulSoup after inspecting HTML tags
  • As long as the designs for these websites don’t change, our news aggregator will work just fine

scraping-code-snippet

The image above only shows a snippet of the code. Procedure to crawl goal.com and sportskeeda.com is more or less same, just with different tags.

Once, I appended the details of all articles in a dataframe and sorted it by recency, I got the following output:

scraping-output

I have also written the output to a text file. It serves like a mini magazine with the links to latest updates on Spanish football from my favorite three websites. Sample results are shown at the end of this blog.

This was a short post on web scraping. Inside18yard will be back with more of new and interesting stuff!

———————————————————————————————————————–

##Article 1##

The latest Antoine Griezmann to Manchester United update  | GiveMeSport

2017-01-16

http://www.givemesport.com//964054-the-latest-antoine-griezmann-to-manchester-united-update

 

##Article 2##

Heaviest defeats faced by 5 legendary managers in club football | Sportskeeda.com

2017-01-16

http://www.sportskeeda.com/football/heaviest-defeats-faced-5-legendary-managers-club-football/

 

##Article 3##

RUMOURS: Man City ready to pay £100m for Messi – Goal.com

2017-01-16

http://www.goal.com/en-us/news/88/spain/2017/01/16/31618542/rumours-man-city-ready-to-pay-100m-for-messi

 

 

 

Football Players’ Popularity Analysis through Twitter Streaming API

Inside18yard is bringing to its readers a real time football players’ popularity analysis through Twitter streaming API using PySpark, Python, NLTK and Plotly.

Rivalries in sports inspire intense emotions. Whether its Federer vs Nadal, Sachin vs Lara or Messi vs Ronaldo, everyone has an opinion. Among these rivalries, Messi vs Ronaldo is really a special one because both of them are still at the peak of their abilities and have already shattered all records. They have completely dominated football’s annual Ballon d’Or best player of the year award for the last 7 years. Football is the most popular sport in the world, played by over 190 countries, each having its national team, few having multiple divisions of domestic leagues for men as well as women. Considering this, it is undeniably an amazing feat that these two players have remained at the top for so long and still they don’t show any signs of decline.

TWEET COUNT

Lionel Messi & Cristiano Ronaldo, already legends, inspire fans around the world and thus, it will be interesting to see who has a stronger fan base. The following is a dynamic bar plot for the number of tweets which mention either Messi or Ronaldo. I have ignored the tweets which mention both because those will not count towards the difference in popularity.

Messi emerges as the clear winner as can be observed from the bar chart of tweet counts from 5 minutes of Twitter stream. I regenerated this plot at multiple times on a day when none of them was playing and the results are more or less the same, Messi coming ahead of Ronaldo though not by a huge margin.

SENTIMENT ANALYSIS

But the above graph does not depict the complete picture as the tweets could be critical to the players rather than being positive, thus, as a matter fact negatively affecting their popularity. To solve this problem, I have extracted the sentiment from the tweets and plotted a dynamic time series of the cumulative sentiment for the two players. The graph is for 5 minutes of Twitter stream and again I have ignored the tweets which mention both players.

Messi’s cumulative sentiment comes out to be higher than Ronaldo at the end of 5 minutes. For Real Madrid fans like me, it will be hard to digest but the data speaks truth. Messi is indeed more popular than Ronaldo.

Nevertheless, for Ronaldo if it helps, I love you both equally!! 😀

OVERALL BUZZ ON TWITTER

Finally, I collected the Twitter stream for 2 hours to get an idea, if these two are really the most talked about footballers. Since, I am using PySpark Accumulators for each player, I had to explicitly mention the player names I want to track. This gives a general direction where we hope to find our popularity winners.

As it turns out, Messi still has most tweets while the second place goes to Neymar ahead of Ronaldo. Though, it can always be argued that Portugal has a much smaller population than Argentina and Brazil which certainly has an affect.

player-popularity

Hope you found this article interesting. Inside18yard will be back with more posts. Keep following us and please Like / Share if it was fun!

How do top European football leagues compare against each other?

Inside18yard is back and this fall we are “going data!” in our articles. This blog in particular focuses on a comparison of top European leagues using Descriptive Statistics.

The 2015-16 season for the European football was an exciting one and even more so for the English Premier League. ‘He just did a Leicester’ has become the catch phrase after we saw, certainly, the greatest underdog story of all time in a team sport and perhaps even in all of sporting history. Leicester City in previous season, after being at the bottom of the table as late as April 2014 with just 9 matches to go and facing a very real threat of relegation somehow managed to keep their place in the Premier League. And it will be unfair to call what happened afterwards as anything less than a fairy tale.

Leicester City FC in the next season, against the 5000:1 odds given to them by bookmakers, went on to win the title against the likes of Spurs, Arsenal and the mighty Manchester Clubs. To wrap your head around the achievement one has to see that the last time any club outside Big 5 won the Premier League was in 1994-95 while there was still some economic parity between the teams.

More impressive is their away goals records. Let us look at how do the numbers tally against the winners of La Liga, Bundesliga and Serie A.

goals

The bar plots above show Leicester has been pretty consistent. Irrespective of whether they were playing at their home King Power Stadium or away, they averaged around 1.75 goals per game.

fouls

Also, they have fared well in the number of fouls committed. But this is only half the picture as the playing style in the leagues vary and the referees tend to give more fouls in Spain.

But now we move away from focusing on just winners and rather we will try to investigate how the four leagues compare to each other on an overall level. More interestingly, we wish to add a temporal dimension to the analysis by looking at the trend of key metrics like Fouls, Yellow/Red Cards, Shots on Target and Goals over the last decade for the top 4 EU leagues.

fouls_trend

There is a visibly evident difference in the fouls committed between the four leagues. Probably, teams in Serie-A because of the inherent defensive playing style associated with Italian sides tend to commit more fouls. Another interesting observation is that the number of fouls have been declining steadily over the years. Perhaps, teams in the mid 2000’s had more violent playing style or referees gave out more fouls back then. Who knows?

We cannot comment on the reason from this data. Nevertheless, an interesting observation.

The trends of total Yellow Cards and Red Cards given by referees also reveal some insights.

cards

We see despite greater number of fouls being committed in Serie A than La Liga, the referees in Italy have given out much less Yellow Cards and Red Cards than their counterparts in Spain. Similar can also be observed for EPL and Bundesliga.

Secondly, the more obvious observation is that there is a visibly evident difference in the number of Red Cards and Yellow Cards given out by referees between these countries. This does tell us something about playing style in these countries. While more tussle is acceptable in England and Germany, the Spanish and Italian referees tend to be stricter.

Another metric to comment upon the playing style in these countries is by looking at the trend of total shots on target over the last decade.

shots_trendQuite clearly the La Liga and Serie-A teams are much more clinical in front of goal than EPL teams. English game is faster and intense but players tend to misfire their shots.

Or perhaps, the second possibility is that EPL teams have better goalkeepers??

Hope, we were able to present some interesting insights from the data on how the top EU football leagues compare against each other. We will be back with more fun reads for the loyal readers of Inside18yard. Stay tuned!!

Data Source – http://www.football-data.co.uk/spainm.php