Ebay web scraping with Python and Beautiful Soup: demand research | Projects
Articles,  Blog

Ebay web scraping with Python and Beautiful Soup: demand research | Projects


Hello everyone today I have another
project for you, it is an eBay web scrapper. Let’s imagine that you want to
start an eBay business to sell some product that have a high demand, and to
do it you have to choose a niche or a product
to sell. So you might need to get some data about the current situation on the
eBay marketplace. And one of the ways to collect needed data is to make a manual
research for the goods or categories of goods. But it is enough tedious and a time consuming task, but some Python programming skills can help you to
gather needed data almost automatically. And today we will write an ebay web
scrapper that will collect some data, and write it to a CSV file. Let’s start. Open
the ebay.com website and let’s say I want to research the men’s watches
category. I am inputting in the search bar the
search request – “men’s watches” and we got the search results. And before I write a
single line of code I have to manually explore the eBay website and determine
what will do the scraper. So let’s open and I think that I want to collect 1) the
title, 2) the price, and 3) the quantity of sold items… from here or maybe from here. So
the Scraper will do the following 1) It has to make request to the eBay website and to get the HTML code of the index page – this one… a page with the list of
products… 2) Then it has to collect all links to the detail pages… this one, this
one, this one and so on, and 3) Then it has to parse need data – a title, a price and so on…
and then 4) it will export the data to the CSV file.
Let’s start write our scraper… I have created the project folder and I am
creating a new Python script with name ‘ebay_scrapper.py’, and now I have to install some Python libraries. So I am using ‘pip3’
for Python 3 if you are using Windows operating system you have only one ‘pip’.
So I’m using ‘pip3 install requests beautifulsoup4 lxml’ I have already installed it and next I am
creating an entry point of the Scrapper – an ‘if main’ block. The IF condition checks
whether the file ‘ebay_scrapper.py’ was run directly
from the console or not. If the file is running from the console its __name__
attribute will be equal to ‘__main__’, but if the file will be imported to another script
its __name__ attribute will contain the name of the file – ‘ebay_scrapper’… In this case
its name attribute will be equal to the ‘ebay_scrapper.py’. And if this block returns
True then the main() function will be called. So I am creating a new function
main() with the pass for a while… the main function() will play the role of a hub,
that will manage the calls of other functions and we’ll collect scraped data.
I want to paste here ‘TO DO’ list and our next step is to get an eBay’s page… So
we need a function that will make requests to the eBay. And let’s define a
new function and name it… let’s say… get_page() that will take URL-address
as an argument. And the get_page() function will make requests with the Requests
library, that we have installed. Now I have to import it… ‘import requests’…
Inside the get_page() function I am creating a new variable
‘response’ that will contain a response of the ebay.com server. The ‘response’
variable is equal to the call of Requests’ .get() method… The .get()
method takes the URL variable as an argument. Next… in the body of the main()
function I’m creating a new variable ‘url’, that equals to the address of this
detail page. I suggest that we will scrape the needed data first and then we
will add functionality to scrape all links to inner pages. So we have a detailed
page URL, that we have pass in to the get_page() function and let’s call the get_page() function and pass in to it the URL variable… here… get_page(url).
Next… we make a get request here a server can respond us differently –
it can respond with “404 not found” error if requested resource does not
exist, or can forbid us an access and will respond with 403
error and so on. And a good idea is to check how the server responded us and to
do it we can use the .ok property and the .status_code property of the
Response – our response variable is exactly the instance of the
Response class. So we can use these properties to do such checking. For
instance:
print(response.ok) and we got True… it means that the eBay
server responded us successfully. Also we can use the .status_code property… we got
200. 200 means that the server responded successfully. Awesome. And now I’m
creating a new IF-ELSE condition – so ‘if not response.ok’, that is any status code accept 200… So if it is not ok I want to have a message and, if the
server responds successfully – pass for a while. Next… we got the server’s
response and now we have to get the HTML code of the page whose URL we have
passed in to the .get() method and at the same time to perform the search efficiently
we need to convert an HTML code to a Python objects tree. And to do it we will use the
Beautiful Soup library, so I have to import the BeautifulSoup
class. So inside the ELSE block I am creating a
new variable ‘soup’… ‘soup’ is an instance of the BeautifulSoup class. The constructor
of the BeautifulSoup class takes at least two arguments – the first argument
is the HTML code of a page, that is ‘response’ and his .text property. The
second argument is a parser that will parse the HTML-code. And the most efficient and fast parser is the LXML parser. Beautiful Soup uses it under the hood, but we have
to specify it as the second argument. And finally the get_page() function will return the BeautifulSoup instance, that is our ‘soup’ variable…
Next step is to scrape needed data… So I am creating a
new function – get_detail_data()… for instance that takes the ‘soup’ object as an
argument. And now let’s perform some analysis of the page. I will take a
title, a price and items sold… Let’s start with the title. And my purpose for now is
to identify these elements on the current page. So I need a title –
right-click on it and choose ‘Inspect’ or ‘Inspect element’. We can see that
Inspector opened up and now we can see that the title is an H1 HTML tag with
the ‘itemTitle’ id (css). So I want to try to find the H1 tag with this particular
CSS-property I mean its ‘id’. And here I am creating ‘h1’
variable that equals to the ‘soup’ object that we got as an argument. And now I need to
use its .find_all() method. So I am passing in to the .find_all() method the name of the tag –
h1, comma and then its identifying property, that is the ‘id’ attribute.
We can just copy and paste it here and let’s
print it out… I’m sorry I forgot to call the get_detailed_data()
function. So here in the main() function I’m calling the get_detail_data() function, and as an argument I am passing in to it the call of the get_page() function
with the ‘url’ variable as an argument. Save and run and we can see that there
is only one H1 tag on the page, so here I can use just the .find() method,
not the find_all() method What’s wrong with the .find() method? The
.find() method returns only the first result and sometimes we can get
unexpected results. So I just wanted to check if the page contains more than one
H1 tag, but anyway we can always correct our code. Let’s print out again
and now we can see here that the title of the product in English is inside the ‘data-mtdes’ attribute of the ‘a’ tag, but it is not the text of H1 tag. So I have
to correct my search request, and here I’m calling the next .find() method. And I want to
focus your attention that this expression returns the BeautifulSoup
object, so we can use BeautifulSoup’s methods and properties further. And i’m calling next .find() method so here… I am looking for the ‘a’ tag
and let’s save and run it again… This time we got the ‘a’ tag and we can see that
the ‘data-mtdes’ attribute contains needed title. So I need to get the value of the ‘data-mtdes’ attribute, but how to do that? This expression returns the ‘a’ tag it’s the
BeautifulSoup object also. And to get the value of the ‘data-mtdes’ attribute I am calling a dictionaries’ .get() method and passing in to it the name of the attribute I want to get, that is the ‘data-mtdes’… Save and run again. And now we got the title of the product in English. But
sometimes websites may use different layout for the same at first glance
object… for instance if a product has a discount then its layout – a title or
a price or something else can have different CSS classes and different IDs. And the
Beautiful Soup cannot find the object with our search queries… For instance
let’s assume that layout for another product is different… Let’s say… add something here and run the script. I got an exception and our script is
terminated – NoneType error. And to handle such exceptions Python provides TRY – EXCEPT blocks So I want to wrap this expression to TRY – EXCEPT blocks. Our get_detail_data() function we’ll try to find needed data
with this search query. If it fails I want to have the ‘h1’ as an
empty string… Let’s rename it to ‘title’… Let’s go further and now I need the price. so I am starting a new TRY – EXCEPT block and I’m repeating my actions… The price I think is
here… so right click – ‘Inspect’ and we can see that the price is a ‘span’ tag with the ‘prcIsum’ CSS id. So let’s find it… I am copying its ‘id’ and here… the ‘price’
variable equals to the ‘soup.find()’… I am looking for a ‘span’ tag with ID equals to the ‘prcIsum’… And here the ‘price’ equals to an empty string. And I want to print out the ‘price’… We got the ‘span’ tag… And now we can see that the price is the text of
the ‘span’ tag. So to get the text I have to use the .text property of the soup
object. Let’s call it again we got the string. And sometimes strings
have spaces at the beginning of the string and at the end, so I want to use
.strip() method to rid off the spaces at the end and at the beginning of this
thing. Also I want to split the string by space. Here’s a space and I can do it
with string’s method .split() And I’m specifying the space as a
separator. Now I got a list with the two elements the first element of the list
is a currency and the second element of the list is the amount. So I want to
unpack it… let’s say… it will be ‘p’ variable and here I’m creating a new one –
the ‘currency’ variable and the ‘price’ variable that equal to ‘p’ object. This
expression returns a string… split by space, and here I want to specify also
‘currency’, that equals to an empty string. Let’s run it again… and we can
see the price and the currency. Ok, great. Next I want to scrape the total
items sold, so let’s check it again I think here is a different layout ‘span’… so we can see here that the total
items sold is a text of the ‘a’ tag… this one and ‘a’ tag is a child of the ‘span’ container
with the ‘soldwithfeedback’ CSS-class… and here there is no such ‘span’. So I need to
specify the class or id or some other identifying property to find
needed result… So these classes are different but this class – ‘vi-qtyS-hot-red’
is the same, so I suggest that we should try this class to find the ‘span’ then
find the ‘a’ tag and and get the text of the ‘a’ tag. So let’s try it… ‘Class’ in Python is a reserved keyword,
so we have to use here an underscore sign. print it out And let’s run again… we can see our ‘span’ tag
So next I have to find an ‘a’ tag. I am adding the call of the .find() method, and passing in to it ‘a’ as an argument. And we got the quantity of salt items. This word
means ‘sold’ like here… the ebay.com just automatically translates all text to
your language and the language he determines by your IP address, so there
is not a ‘sold’ word but it doesn’t really matter.
Next I need to split this string by space again and I want only
first element. So I’m specifying the index zero and again I got the total
quantity of sold items. Great! And now I want to pack all scraped data to a
dictionary. I’m creating a new variable data that equals to a dictionary with
the key ‘title’… the value of this ‘title’ key will be the ‘title’ variable.
And so on… And our function will return
the ‘data’ dictionary. The next step is to scrape all links to the inner pages, to
the detailed pages of each product. Let’s look at the index page, but firstly –
at the URL address… this one… and we can see that the URL address consists of
main part and the other smaller parts separated by ampersand sign. For instance
here, here and here and this parts are parameters of a get request to the
server. And by manipulating these parameters we can get needed output but
now I want to simple the URL. So I am deleting all parameters except our
search query – ‘men’s watches’ I am delete this and this… and let’s try
it… and it works – we got the same result. Great! And now I’m scrolling the
page down to the pagination block and this is a pagination block and when I hover
the mouse over page numbers at the left bottom corner we can see that the URL
have gotten one more part – a page number ‘pgn=6’… 5… 4.. et cetera. So to get
other index pages we should just change &_pgn=2… and we can see the second Change it to 3… so by changing the number of the
page we can get another index pages. I’m copying the new URL and paste it here.
Now I want to scrape all links to the detail pages of which
product only on the current page the other pages are scraped via for loop –
just with changing the page number. And to get all links let’s create a
new function… let’s say… get_index_data() that will take a soup object as an
argument. Here – try except block and now I go to the
index page, right-click and Inspect. And we can see that we need an ‘a’ tag with
the class ‘s-item__link’… copy the class of the ‘a’ tag and now I have to find all ‘a’
tags on the page with the same class. So here…. ‘links’ variable equals to the
soup object… this one… I am calling the .find_all() method to
scrape all ‘a’ tags with the class. I just paste the CSS class – ‘s-item__link’.
In the EXCEPT block:
links=[] The .find_all() method returns a list. Let’s
print it out. And here I have to call the get_index_data() function… I got the list and this is the list of ‘a’
tags objects, but I no need entire ‘a’ tag with all its attributes, I need only
‘href’ attribute… for instance, this one… so I am creating a new variable… let’s
say it will be ‘urls’ that equals to an empty list. And I want to use a
list comprehension to fill the list with the URL addresses to inner pages. To do it I
have to use the .get() method of each element of the ‘links’ list to get the
‘href’ attribute. So let’s say item.get(‘href’) as an argument for each ‘item’ in the
‘links’ list. Let’s print it out. And we got just links (urls) and our next step
is to iterate this list of urls with a FO loop and on each iteration we have
to call get_detailed_data() function for each URL – the element of the list. So in
the main() function I need a list of URL addresses for each product. So I create a
new variable… let’s say it will be ‘products’ that equals to the call of the get_index_data() function… We got this list of links and now I’m starting a FOR loop…
And here I’m creating a new variable ‘data’ and then I
am calling the get_detail_data() function then passing in to it the get_page() function with the
‘url’ variable… this one… a local ‘url’ variable. Okay, let’s rename it to a
‘link’ variable… And we will get such a dictionary. Let’s print it out
And run the script. I got an error – “NoneType object is not
iterable”… I think that the problem is here – I forgot to return the ‘urls’
variable. Let’s run it again. Okay… we got our dictionaries – with a price, a currency, a total sold value… Grate! And here we can see in the ‘total sold’ key
these strange symbols. It is unbreakable space. So we can rid off it… I
am calling the .replace() method it’s strings’ method, and the first
argument is WHAT I want to replace – ‘xa0’ (unbreakable space) and
WHAT I want INSTEAD of it… let’s say it will be an empty string. Ok.
The next step… I want to save this dictionaries to a CSV file. So I want to create a new
function – write_csv() that will get the ‘data’ dictionary as an argument. And
as we want to use CSV we need to import it. Inside the write_csv() function I’m using the
WITH context manager to deal with a file, then I am calling the open()
function and I am passing in to the open() function a file name… let’s say it will
be ‘output.csv’. And if the open() function finds the file with this file name, it
will open it. Otherwise the open() function will create a new file with the
‘output.csv’ file name. And also I want to specify a second argument – it will be the
‘a’ letter. ‘A’ means ‘append’ and thus a new data will be appended to the end of the
file ‘output.csv’. And this opened for writing file object I want to save to
the variable ‘csvfile’ with the ‘as’ keyword. Next… I’m creating a new
variable ‘writer’ that equals to the CSV module and I’m calling the .writer() method,
that takes our ‘csvfile’ object as an argument… And finally I want to write the
content of the ‘data’ dictionary… this one… to the CSV file… this one… so I’m calling
the writer’s method .writerow() method and the .writerow() method
takes only one argument, so I am creating a new available… let’s say it will be a
‘row’ that equals to a list. And here I am
enumerating the keys of the ‘data’ dictionary in the order I want to have in the CSV file. So we have the ‘title’, the ‘price’, the ‘currency’, the ‘total
sold’ keys in the ‘data’ dictionary. And then I am passing the ‘row’ variable in to the write_row() funciton. Also I think
that it will be useful to pass in to the write_csv() function the URL address. So I
am adding to the call of the write_csv() function… let’s say here… I’m adding write_csv() function, and ‘data’ as the first argument and the ‘link’ variable… this
one… as the second. So here I’m specifying the second parameter a ‘link’ or ‘url’…
let’s say it will be a ‘url’. And here… at the end I’m specifying the ‘url’. So let’s
run the script. We got ‘output.csv’ and let’s look at it in Libre office or in
Microsoft Office. Ok. There is our data and we can see here some blank cells. So
we can copy this URL and check it with Inspector again.
There is a discount and the ID is different. Ok, and how we can
solve this problem? We can create here a nested try/except block and
here we can try to find the ‘span’ tag with different ID.
Move the ‘currency’ definition here. In the same level as the main try/except block. So
let’s delete it and run the script again. Done and let’s open it with
LibreOffice… Now we can see that we filled empty cells, but our data are not
normalized I would like to rid off the dollar sign, but… maybe the next time. Also
we have a blank title. so sometimes title has different CSS
attributes but it doesn’t matter. Let’s sort
these columns by ‘total sold’ it’s a D column. Data menu – sort, column D,
descending order. Ok. And we can see that the most demand and the most successful
product is this watches. It’s almost two bucks and here, for instance, 1000 sold
items with the price almost $85… I’m sorry… Let’s copy and paste this url $85 and 1000 sold. And this one… 22,000 sold and one and a half of dollars price… Oh, my God… Never the less this is the most
demanded product at first glance. And if you like the video please click the
‘like’ and subscribe. Thank you.

15 Comments

  • U2FsdGVkX1x

    Followed your tutorial and something went wrong: When I run my script and print(h1), I got None. I've got a response.text, BUT It looks like some JSON files rather than HTML .

  • dendi sega

    А можно такое же видео, но по Авито? И Вы не показали как автоматически менять страницы с 1 на 2 и т.д

  • Peter Napoli

    Up to 14:06 in the tutorial, the line of code:

    h1 = soup.find('h1', id='itemTitle').find('a').get('data-mtdes') throws an error:

    File "ebay.py", line 29, in <module>

    main()

    File "ebay.py", line 26, in main

    get_detail_data(get_page(url))

    File "ebay.py", line 19, in get_detail_data

    h1 = soup.find('h1', id='itemTitle').find('a').get('data-mtdes')

    tributeError: 'NoneType' object has no attribute 'get'

  • kumar Mannu

    The asm tutorial,i really enjoy and learn a lot!! Thanku so much.

    The only thing i want to ask ,my output is like"Leather Band Round Quartz Analog Elegant Classic Casual Men&aposs Wrist Watch New | eBay" for the title one.How to remove this '|ebay '.I checked but not getting this

  • Luka Kujundžić-Lujan

    i fucking love russian accent. i hate listening tutorials with indians.. i dont have nothing against them, just cant listen to them.. but russian.. i love that language.. i know little russian.. ya ponimayu russky nemnogo 😀 croatian is also very similar to russian . cheers.

  • Carlos Matos Fanpage

    Hello.
    I want to build a website that compares the price of a product across different websites. How do I display the data which I have scraped on my website?

Leave a Reply

Your email address will not be published. Required fields are marked *