Datasheet
Languages:
Python
Technologies:
beautifulsoup4 (plus FTDHandler and StringStats, libraries I’ve built)
Repository:
https://github.com/rodrigogomesrc/CrawlerStats
About
The project consists of a Crawler module and a Stats module, as the name suggests. What does that mean? The project is about navigating through page links and using those links to extract statistics about their pages.
In more detail, the program I built scrolls through pages, captures and save their link. Then those links are acessed to look for more links. The program keeps doing this until it reaches the minimum limit of links to be obtained.
After obtaining all these links, another project module walks through and saves the text of these pages in a file and uses that file to perform analysis and generate statistics.
These statistics, in this project, are about knowing the amount of words and characters, with their respective frequencies, taking into account or not the so-called “stopwords”, a computation term that is defined by words that are filtered before language processing Natural. These removed words are usually the most common words in languages, such as articles and prepositions, as they do not give much information about the actual meaning and content of texts.
Why did I build the project?
Before building this project, I built two libraries that I used in this project: FTDHandler and StringStats. The first to manipulate text from files and the second to provide functions that analyse those texts. I did it both for Python practice and because analytics and statistics are of interest to me.
After I built these libraries, I came up with ideas for analyzing more real things. So, I built it with the intention of testing with Wikipedia. And that’s what I did.
Project future
As I’ve build this project to help me in extracting data from wikipedia, I intend to use it with other data processing tools and also work with the twitter api to extract data from there as well. After having all this data I can learn how to process it and analyse it in many different ways.
Current status
This project was made more or less like a sketch with the objective of testing tecnologies. Currently I’m not working on it neither I have plans to continue its development for the time being. But this can change in the future if I find a greater purpose for it..
About the project articles
These articles contained in the website (blog) of mine are here to describe many aspects of programming projects (or other types os projects if necessary) I develop. Describing development aspects or other informations I find necessary; They are more like a report than an actual article. Those are here to serve as my portfolio.
As these projects can be a work in progress, so are these “reports”. So it’s not a fixed article but one to be edited as the informations about the project changes, though mantaining it consistent to the project development.