CMSC320 FINAL PROJECT

Metal Bands Around The World

Name: Yan-Jen Lo      UID: 114204813      Due Date: 12/16/17 @ 1:30 PM

INTERACTIVE MENU: \m/ ( -_- ) \m/

Introduction


Inspiration: Kaggle's Metal by Nation and Genre Data

While looking around for a data set to use, I stumbled upon this kaggle entry on metal bands. Being a metal head since middle school, this peaked my interest like no other data set I've seen so far. After downloading the two data sets from here:

  1. metal_bands_2017.csv
  2. world_population_1960_2015.csv
The dataset had only 5000 metal bands and the world population data didn't go up to 2017. Also if you check out this kaggle notebook, someone has already analyzed the data. I wanted to see for myself if the analysis for 5000 bands still holds for this data which is more than 20x larger.

Data Sources: Encyclopedia Metallum & Wikipedia

Data Link Stats Date Accessed
Encyclopedia Metallum Metal-Archives 119474 Bands 12/5/17
Wikipedia Population Data 240 Countries 12/15/17

When I use to look up bands, I used the Encyclopedia Metallum and the database had just about every obscure band you could find. Since they let users contribute to the databse (the acceptance process is heavily monitored), small bands can put themselves into the databse. Instead of using the kaggle data, I decided to scrape the metal archives myself since I felt that 5000 bands was lacking in terms of analysis for the entire world. For population data, I opted for Wikipedia since most of their population information was based on a real time world population clock.

Goal: Analysis With Extensive Visualization (Graphs & Interactive World Map)


QUESTIONS:
  1. Which genre type seems to dominate the metal playing field?
  2.  
  3. Which country is the most metal?

Scrap All Countries


In [2]:
import time
from multiprocessing  import Pool
import requests
from bs4 import BeautifulSoup
import html5lib
import json

import pandas as pd
import numpy as np

import csv
import re

HTTP Request

To scrap the countries page of Encyclopedia Metallum, we need to make a HTTP request to the URL. If the request was successful and a connection was created, the response should be:

<Response [200]>
In [18]:
enyclopaedia_metallum_country_url = 'https://www.metal-archives.com/browse/country'
countries_request = requests.get(enyclopaedia_metallum_country_url)
print(countries_request)
<Response [200]>

BeautifulSoup

To convert the request response into usable HTML tags easily, BeautifulSoup is the way to go. Make sure you have these imported:

from bs4 import BeautifulSoup
import html5lib

The html5lib library helps convert the response to HTML, thus allowing the use of .find_all(tag). Your goal is to look at the HTML and figure which tags have the information you want and specifically target that tag to filter out as much as possible.

  • In this case, the columns of countries starts with the tag:

<div class="countryCol">
Common .find_all Hacks

1) href or class field tag

<a href="https://www.metal-archives.com/lists/AF">Afghanistan"</a>
<div class="countryCol">
.find_all('a', href = True)
.find_all('div', class_ = True)

2) Specify field specifics with dictionaries

<div class="clear loading">
.find_all('a', {'class': 'clear loading'})
In [16]:
soup = BeautifulSoup(countries_request.content, 'html5lib')
countries_soup = soup.find_all('div',{'class': 'countryCol'})
countries_soup
Out[16]:
[<div class="countryCol">
 											<a href="https://www.metal-archives.com/lists/AF">Afghanistan</a><br/>
 											<a href="https://www.metal-archives.com/lists/AX">Åland Islands</a><br/>
 											<a href="https://www.metal-archives.com/lists/AL">Albania</a><br/>
 											<a href="https://www.metal-archives.com/lists/DZ">Algeria</a><br/>
 											<a href="https://www.metal-archives.com/lists/AD">Andorra</a><br/>
 											<a href="https://www.metal-archives.com/lists/AO">Angola</a><br/>
 											<a href="https://www.metal-archives.com/lists/AR">Argentina</a><br/>
 											<a href="https://www.metal-archives.com/lists/AM">Armenia</a><br/>
 											<a href="https://www.metal-archives.com/lists/AW">Aruba</a><br/>
 											<a href="https://www.metal-archives.com/lists/AU">Australia</a><br/>
 											<a href="https://www.metal-archives.com/lists/AT">Austria</a><br/>
 											<a href="https://www.metal-archives.com/lists/AZ">Azerbaijan</a><br/>
 											<a href="https://www.metal-archives.com/lists/BH">Bahrain</a><br/>
 											<a href="https://www.metal-archives.com/lists/BD">Bangladesh</a><br/>
 											<a href="https://www.metal-archives.com/lists/BB">Barbados</a><br/>
 											<a href="https://www.metal-archives.com/lists/BY">Belarus</a><br/>
 											<a href="https://www.metal-archives.com/lists/BE">Belgium</a><br/>
 											<a href="https://www.metal-archives.com/lists/BZ">Belize</a><br/>
 											<a href="https://www.metal-archives.com/lists/BO">Bolivia</a><br/>
 											<a href="https://www.metal-archives.com/lists/BA">Bosnia and Herzegovina</a><br/>
 											<a href="https://www.metal-archives.com/lists/BW">Botswana</a><br/>
 											<a href="https://www.metal-archives.com/lists/BR">Brazil</a><br/>
 											<a href="https://www.metal-archives.com/lists/BN">Brunei</a><br/>
 											<a href="https://www.metal-archives.com/lists/BG">Bulgaria</a><br/>
 											<a href="https://www.metal-archives.com/lists/KH">Cambodia</a><br/>
 											<a href="https://www.metal-archives.com/lists/CA">Canada</a><br/>
 											<a href="https://www.metal-archives.com/lists/CL">Chile</a><br/>
 											<a href="https://www.metal-archives.com/lists/CN">China</a><br/>
 											<a href="https://www.metal-archives.com/lists/CO">Colombia</a><br/>
 											<a href="https://www.metal-archives.com/lists/CR">Costa Rica</a><br/>
 											<a href="https://www.metal-archives.com/lists/HR">Croatia</a><br/>
 											<a href="https://www.metal-archives.com/lists/CU">Cuba</a><br/>
 											<a href="https://www.metal-archives.com/lists/CW">Curaçao</a><br/>
 											<a href="https://www.metal-archives.com/lists/CY">Cyprus</a><br/>
 											<a href="https://www.metal-archives.com/lists/CZ">Czech Republic</a><br/>
 											<a href="https://www.metal-archives.com/lists/DK">Denmark</a><br/>
 											<a href="https://www.metal-archives.com/lists/DO">Dominican Republic</a><br/>
 											<a href="https://www.metal-archives.com/lists/EC">Ecuador</a><br/>
 											<a href="https://www.metal-archives.com/lists/EG">Egypt</a><br/>
 											<a href="https://www.metal-archives.com/lists/SV">El Salvador</a><br/>
 											<a href="https://www.metal-archives.com/lists/EE">Estonia</a><br/>
 											<a href="https://www.metal-archives.com/lists/ET">Ethiopia</a><br/>
 											<a href="https://www.metal-archives.com/lists/FO">Faroe Islands</a><br/>
 											<a href="https://www.metal-archives.com/lists/FI">Finland</a><br/>
 											<a href="https://www.metal-archives.com/lists/FR">France</a><br/>
 											<a href="https://www.metal-archives.com/lists/PF">French Polynesia</a><br/>
 											<a href="https://www.metal-archives.com/lists/GE">Georgia</a><br/>
 											<a href="https://www.metal-archives.com/lists/DE">Germany</a><br/>
 											<a href="https://www.metal-archives.com/lists/GI">Gibraltar</a><br/>
 											<a href="https://www.metal-archives.com/lists/GR">Greece</a><br/>
 									</div>, <div class="countryCol">
 											<a href="https://www.metal-archives.com/lists/GL">Greenland</a><br/>
 											<a href="https://www.metal-archives.com/lists/GU">Guam</a><br/>
 											<a href="https://www.metal-archives.com/lists/GT">Guatemala</a><br/>
 											<a href="https://www.metal-archives.com/lists/GG">Guernsey</a><br/>
 											<a href="https://www.metal-archives.com/lists/GY">Guyana</a><br/>
 											<a href="https://www.metal-archives.com/lists/HN">Honduras</a><br/>
 											<a href="https://www.metal-archives.com/lists/HU">Hungary</a><br/>
 											<a href="https://www.metal-archives.com/lists/IS">Iceland</a><br/>
 											<a href="https://www.metal-archives.com/lists/IN">India</a><br/>
 											<a href="https://www.metal-archives.com/lists/ID">Indonesia</a><br/>
 											<a href="https://www.metal-archives.com/lists/XX">International</a><br/>
 											<a href="https://www.metal-archives.com/lists/IR">Iran</a><br/>
 											<a href="https://www.metal-archives.com/lists/IQ">Iraq</a><br/>
 											<a href="https://www.metal-archives.com/lists/IE">Ireland</a><br/>
 											<a href="https://www.metal-archives.com/lists/IM">Isle of Man</a><br/>
 											<a href="https://www.metal-archives.com/lists/IL">Israel</a><br/>
 											<a href="https://www.metal-archives.com/lists/IT">Italy</a><br/>
 											<a href="https://www.metal-archives.com/lists/JM">Jamaica</a><br/>
 											<a href="https://www.metal-archives.com/lists/JP">Japan</a><br/>
 											<a href="https://www.metal-archives.com/lists/JE">Jersey</a><br/>
 											<a href="https://www.metal-archives.com/lists/JO">Jordan</a><br/>
 											<a href="https://www.metal-archives.com/lists/KZ">Kazakhstan</a><br/>
 											<a href="https://www.metal-archives.com/lists/KE">Kenya</a><br/>
 											<a href="https://www.metal-archives.com/lists/KR">Korea, South</a><br/>
 											<a href="https://www.metal-archives.com/lists/KW">Kuwait</a><br/>
 											<a href="https://www.metal-archives.com/lists/KG">Kyrgyzstan</a><br/>
 											<a href="https://www.metal-archives.com/lists/LA">Laos</a><br/>
 											<a href="https://www.metal-archives.com/lists/LV">Latvia</a><br/>
 											<a href="https://www.metal-archives.com/lists/LB">Lebanon</a><br/>
 											<a href="https://www.metal-archives.com/lists/LY">Libya</a><br/>
 											<a href="https://www.metal-archives.com/lists/LI">Liechtenstein</a><br/>
 											<a href="https://www.metal-archives.com/lists/LT">Lithuania</a><br/>
 											<a href="https://www.metal-archives.com/lists/LU">Luxembourg</a><br/>
 											<a href="https://www.metal-archives.com/lists/MK">Macedonia (FYROM)</a><br/>
 											<a href="https://www.metal-archives.com/lists/MG">Madagascar</a><br/>
 											<a href="https://www.metal-archives.com/lists/MY">Malaysia</a><br/>
 											<a href="https://www.metal-archives.com/lists/MV">Maldives</a><br/>
 											<a href="https://www.metal-archives.com/lists/MT">Malta</a><br/>
 											<a href="https://www.metal-archives.com/lists/MU">Mauritius</a><br/>
 											<a href="https://www.metal-archives.com/lists/MX">Mexico</a><br/>
 											<a href="https://www.metal-archives.com/lists/MD">Moldova</a><br/>
 											<a href="https://www.metal-archives.com/lists/MC">Monaco</a><br/>
 											<a href="https://www.metal-archives.com/lists/MN">Mongolia</a><br/>
 											<a href="https://www.metal-archives.com/lists/ME">Montenegro</a><br/>
 											<a href="https://www.metal-archives.com/lists/MA">Morocco</a><br/>
 											<a href="https://www.metal-archives.com/lists/MZ">Mozambique</a><br/>
 											<a href="https://www.metal-archives.com/lists/MM">Myanmar</a><br/>
 											<a href="https://www.metal-archives.com/lists/NA">Namibia</a><br/>
 											<a href="https://www.metal-archives.com/lists/NP">Nepal</a><br/>
 											<a href="https://www.metal-archives.com/lists/NL">Netherlands</a><br/>
 									</div>, <div class="countryCol">
 											<a href="https://www.metal-archives.com/lists/NC">New Caledonia</a><br/>
 											<a href="https://www.metal-archives.com/lists/NZ">New Zealand</a><br/>
 											<a href="https://www.metal-archives.com/lists/NI">Nicaragua</a><br/>
 											<a href="https://www.metal-archives.com/lists/NO">Norway</a><br/>
 											<a href="https://www.metal-archives.com/lists/OM">Oman</a><br/>
 											<a href="https://www.metal-archives.com/lists/PK">Pakistan</a><br/>
 											<a href="https://www.metal-archives.com/lists/PA">Panama</a><br/>
 											<a href="https://www.metal-archives.com/lists/PY">Paraguay</a><br/>
 											<a href="https://www.metal-archives.com/lists/PE">Peru</a><br/>
 											<a href="https://www.metal-archives.com/lists/PH">Philippines</a><br/>
 											<a href="https://www.metal-archives.com/lists/PL">Poland</a><br/>
 											<a href="https://www.metal-archives.com/lists/PT">Portugal</a><br/>
 											<a href="https://www.metal-archives.com/lists/PR">Puerto Rico</a><br/>
 											<a href="https://www.metal-archives.com/lists/QA">Qatar</a><br/>
 											<a href="https://www.metal-archives.com/lists/RE">Reunion</a><br/>
 											<a href="https://www.metal-archives.com/lists/RO">Romania</a><br/>
 											<a href="https://www.metal-archives.com/lists/RU">Russia</a><br/>
 											<a href="https://www.metal-archives.com/lists/SM">San Marino</a><br/>
 											<a href="https://www.metal-archives.com/lists/SA">Saudi Arabia</a><br/>
 											<a href="https://www.metal-archives.com/lists/RS">Serbia</a><br/>
 											<a href="https://www.metal-archives.com/lists/SG">Singapore</a><br/>
 											<a href="https://www.metal-archives.com/lists/SK">Slovakia</a><br/>
 											<a href="https://www.metal-archives.com/lists/SI">Slovenia</a><br/>
 											<a href="https://www.metal-archives.com/lists/ZA">South Africa</a><br/>
 											<a href="https://www.metal-archives.com/lists/ES">Spain</a><br/>
 											<a href="https://www.metal-archives.com/lists/LK">Sri Lanka</a><br/>
 											<a href="https://www.metal-archives.com/lists/SR">Suriname</a><br/>
 											<a href="https://www.metal-archives.com/lists/SJ">Svalbard</a><br/>
 											<a href="https://www.metal-archives.com/lists/SE">Sweden</a><br/>
 											<a href="https://www.metal-archives.com/lists/CH">Switzerland</a><br/>
 											<a href="https://www.metal-archives.com/lists/SY">Syria</a><br/>
 											<a href="https://www.metal-archives.com/lists/TW">Taiwan</a><br/>
 											<a href="https://www.metal-archives.com/lists/TJ">Tajikistan</a><br/>
 											<a href="https://www.metal-archives.com/lists/TH">Thailand</a><br/>
 											<a href="https://www.metal-archives.com/lists/TT">Trinidad and Tobago</a><br/>
 											<a href="https://www.metal-archives.com/lists/TN">Tunisia</a><br/>
 											<a href="https://www.metal-archives.com/lists/TR">Turkey</a><br/>
 											<a href="https://www.metal-archives.com/lists/TM">Turkmenistan</a><br/>
 											<a href="https://www.metal-archives.com/lists/UG">Uganda</a><br/>
 											<a href="https://www.metal-archives.com/lists/UA">Ukraine</a><br/>
 											<a href="https://www.metal-archives.com/lists/AE">United Arab Emirates</a><br/>
 											<a href="https://www.metal-archives.com/lists/GB">United Kingdom</a><br/>
 											<a href="https://www.metal-archives.com/lists/US">United States</a><br/>
 											<a href="https://www.metal-archives.com/lists/ZZ">Unknown</a><br/>
 											<a href="https://www.metal-archives.com/lists/UY">Uruguay</a><br/>
 											<a href="https://www.metal-archives.com/lists/UZ">Uzbekistan</a><br/>
 											<a href="https://www.metal-archives.com/lists/VE">Venezuela</a><br/>
 											<a href="https://www.metal-archives.com/lists/VN">Vietnam</a><br/>
 									</div>]

As you can see from above, all the countries are in this tag format:

<a href="https://www.metal-archives.com/lists/(COUNTRY ID)">(COUNTRY)</a>

I simply extracted the countries out by looping through every country tag and using a regular expression's (regex) .groups() capabilities. Don't forget to import the regex library before use. Regexes are pretty useful, take a look here for official documentation.

import re
re.compile('<a href="https://www.metal-archives.com\/lists\/(.+)">(.+)</a>')

With the parenthesis around the parts where I want to extract information, .groups() will return a tuple in the form of (COUNTRY ID, COUNTRY).

In [19]:
# Create country dictionary where country is the key and the country id is the value
all_countries = {}
country_regex = re.compile('<a href="https://www.metal-archives.com\/lists\/(.+)">(.+)</a>')

for col in range(len(countries_soup)):
    country_a_href = countries_soup[col].find_all('a') # Further filtering tags to only country tags
    for a_href in range(len(country_a_href)):
        tag = str(country_a_href[a_href]).strip()
        matched = country_regex.match(tag).groups() # (ID, country)
        all_countries[matched[0]] = (matched[1])
        
all_countries
Out[19]:
{'AD': 'Andorra',
 'AE': 'United Arab Emirates',
 'AF': 'Afghanistan',
 'AL': 'Albania',
 'AM': 'Armenia',
 'AO': 'Angola',
 'AR': 'Argentina',
 'AT': 'Austria',
 'AU': 'Australia',
 'AW': 'Aruba',
 'AX': 'Åland Islands',
 'AZ': 'Azerbaijan',
 'BA': 'Bosnia and Herzegovina',
 'BB': 'Barbados',
 'BD': 'Bangladesh',
 'BE': 'Belgium',
 'BG': 'Bulgaria',
 'BH': 'Bahrain',
 'BN': 'Brunei',
 'BO': 'Bolivia',
 'BR': 'Brazil',
 'BW': 'Botswana',
 'BY': 'Belarus',
 'BZ': 'Belize',
 'CA': 'Canada',
 'CH': 'Switzerland',
 'CL': 'Chile',
 'CN': 'China',
 'CO': 'Colombia',
 'CR': 'Costa Rica',
 'CU': 'Cuba',
 'CW': 'Curaçao',
 'CY': 'Cyprus',
 'CZ': 'Czech Republic',
 'DE': 'Germany',
 'DK': 'Denmark',
 'DO': 'Dominican Republic',
 'DZ': 'Algeria',
 'EC': 'Ecuador',
 'EE': 'Estonia',
 'EG': 'Egypt',
 'ES': 'Spain',
 'ET': 'Ethiopia',
 'FI': 'Finland',
 'FO': 'Faroe Islands',
 'FR': 'France',
 'GB': 'United Kingdom',
 'GE': 'Georgia',
 'GG': 'Guernsey',
 'GI': 'Gibraltar',
 'GL': 'Greenland',
 'GR': 'Greece',
 'GT': 'Guatemala',
 'GU': 'Guam',
 'GY': 'Guyana',
 'HN': 'Honduras',
 'HR': 'Croatia',
 'HU': 'Hungary',
 'ID': 'Indonesia',
 'IE': 'Ireland',
 'IL': 'Israel',
 'IM': 'Isle of Man',
 'IN': 'India',
 'IQ': 'Iraq',
 'IR': 'Iran',
 'IS': 'Iceland',
 'IT': 'Italy',
 'JE': 'Jersey',
 'JM': 'Jamaica',
 'JO': 'Jordan',
 'JP': 'Japan',
 'KE': 'Kenya',
 'KG': 'Kyrgyzstan',
 'KH': 'Cambodia',
 'KR': 'Korea, South',
 'KW': 'Kuwait',
 'KZ': 'Kazakhstan',
 'LA': 'Laos',
 'LB': 'Lebanon',
 'LI': 'Liechtenstein',
 'LK': 'Sri Lanka',
 'LT': 'Lithuania',
 'LU': 'Luxembourg',
 'LV': 'Latvia',
 'LY': 'Libya',
 'MA': 'Morocco',
 'MC': 'Monaco',
 'MD': 'Moldova',
 'ME': 'Montenegro',
 'MG': 'Madagascar',
 'MK': 'Macedonia (FYROM)',
 'MM': 'Myanmar',
 'MN': 'Mongolia',
 'MT': 'Malta',
 'MU': 'Mauritius',
 'MV': 'Maldives',
 'MX': 'Mexico',
 'MY': 'Malaysia',
 'MZ': 'Mozambique',
 'NA': 'Namibia',
 'NC': 'New Caledonia',
 'NI': 'Nicaragua',
 'NL': 'Netherlands',
 'NO': 'Norway',
 'NP': 'Nepal',
 'NZ': 'New Zealand',
 'OM': 'Oman',
 'PA': 'Panama',
 'PE': 'Peru',
 'PF': 'French Polynesia',
 'PH': 'Philippines',
 'PK': 'Pakistan',
 'PL': 'Poland',
 'PR': 'Puerto Rico',
 'PT': 'Portugal',
 'PY': 'Paraguay',
 'QA': 'Qatar',
 'RE': 'Reunion',
 'RO': 'Romania',
 'RS': 'Serbia',
 'RU': 'Russia',
 'SA': 'Saudi Arabia',
 'SE': 'Sweden',
 'SG': 'Singapore',
 'SI': 'Slovenia',
 'SJ': 'Svalbard',
 'SK': 'Slovakia',
 'SM': 'San Marino',
 'SR': 'Suriname',
 'SV': 'El Salvador',
 'SY': 'Syria',
 'TH': 'Thailand',
 'TJ': 'Tajikistan',
 'TM': 'Turkmenistan',
 'TN': 'Tunisia',
 'TR': 'Turkey',
 'TT': 'Trinidad and Tobago',
 'TW': 'Taiwan',
 'UA': 'Ukraine',
 'UG': 'Uganda',
 'US': 'United States',
 'UY': 'Uruguay',
 'UZ': 'Uzbekistan',
 'VE': 'Venezuela',
 'VN': 'Vietnam',
 'XX': 'International',
 'ZA': 'South Africa',
 'ZZ': 'Unknown'}

Query All Countries With HTTP Request

This part is somewhat complex because of the way the Encyclopedia Metallum is made. When you click on a country, you will see all the bands for that specific country.

Here are the 3 main difficulties with scraping this part of the database:

  1. If you try to BeautifulSoup the HTML code, you won't see any of the listed bands because the webpage is a JavaScript rendered webpage. As you can see with the inspect element I captured through Internet Explorer (IE).
  2. If the country has more than 500 bands total, they are displayed by page number in batches of 500.
  3. If you try to scrap the same website a lot, most websites can detect it, and thus disconnect your HTTP connection.

Sources Used:

1) Source for deriving solutions to #1 and #2: jonchar on GitHub wrote his own scraper for the Encyclopedia Metallum.

2) Solution to number #3 was from this Stack Overflow post and links to HTTP headers are here.

Solution to Problem #1: AJAX Call With HTTP Requests

You can see the JavaScript command to the AJAX Databse (this is where all their data on bands is stored). To specify which country you want to see, you simply put in the country's ID number, which we have extracted earlier.

var grid = createGrid("#bandListCountry", 500, 'browse/ajax-country/c/(COUNTRY ID)/json/1/', {aoColumns: [ null, null, null, { bSortable: false, sWidth: '80px'} ] });

Instead of doing an HTTP Request like usual, we have to add the AJAX call to the end of the URL. The AJAX database doesn't return HTML, it gives you data in JSON form, so essentially Python dictionary form. Make sure you have the JSON library imported.

import json

Solution to Problem #2: Parameters & Payload With HTTP Requests

When a country has more than 500 bands, there will be more than one page, therefore we use the params part of HTTP requests and give it a payload. Look at jonchar's README.md for detail explanation of payload. Basically, to get the first page, you want:

payload = {'sEcho': 0, 'iDisplayStart': 0, 'iDisplayLength': 500}

For every consecutive page, you increase both start and end by 500. You want to do this until you've reached the last page.

Solution to Problem #3: Headers & User-Agents With HTTP Requests

Since we are querying one country (there's around 140 countries listed) at a time and each page at a time, we'll make maybe close to 200 requests to Encyclopedia Metallum. To prevent the website from disconnecting our HTTP request, we have to pretend to be browser accessing the page. The best way is to give the request a User-Agent along with the request. With the header below, I'm pretending to be Firefox:

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1'}
In [11]:
# Helper function that returns all the bands listed for 
# a specific country and the specific page requested
def get_page_query(country, CID, start, end):
    
    # Full AJAX database call components
    URL = 'http://www.metal-archives.com'
    AJAX_REQUEST_BEGIN = '/browse/ajax-country/c/'
    AJAX_REQUEST_END = '/json/1/'
    
    payload = {'sEcho': 0,
                'iDisplayStart': start,
                'iDisplayLength': end} 
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1'}
    
    # Request with payload and headers
    r = requests.get(URL + AJAX_REQUEST_BEGIN + CID + AJAX_REQUEST_END, 
                     params = payload, 
                     headers = headers)
    
    # Converting response information into Python dictionary
    page_content = json.loads(r.text)
    total = page_content['iTotalRecords']
    
    r.close()
    
    return {'finished_query': total <= end, 'data': page_content['aaData'], 'total': total}

This helper function uses the above helper function. It extracts the information we want and writes it into the TSV passed in. This part mainly does the parsing where as the one above deals with the HTTP Request. Essentially a single call will write all the bands for the country requested into the designated TSV file.

In [10]:
# Helper function that uses the other helper function to get
# all the bands listed for a specific country
def get_bands_by_country(opened_file, writer, country,CID):

    data = []

    start = 0
    end = 0
    query_complete = False
    
    # Regexes
    name_ahref_tag = re.compile("<a href='(.*)'>(.*)</a>") # Name
    status_spanclass_tag = re.compile('<span class="(.*)">(.*)</span>') # Status
    website_regex = re.compile('https://www.metal-archives.com/bands/(.*)/(.*)') # Website and ID
    
    # Until the last page of the country is reached, keep querying
    while query_complete == False:

        start = end + 1
        end = end + 500
        query = get_page_query(country, CID, start, end)
        query_complete = query['finished_query']
        query_data = query['data'] # Contains Genres and Location

        band_info = []
        
        for idx in range(len(query_data)):
            
            band = query_data[idx]
            name_match = name_ahref_tag.match(band[0]).groups()
            status_match = status_spanclass_tag.match(band[3])

            name = name_match[1]
            website = name_match[0]
            
            ID = website_regex.match(website).groups()[1]
            genres = band[1]
            location = band[2]
            status = status_match.groups()[1]
            
            band_info = [name,ID,
                         country,CID,location,
                         genres,status,website]
            
            writer.writerow(band_info) # Write into passed in TSV file
            
            
    print(country + ' (' + CID + '): SUCCESS [' + str(query['total']) + ']')
    return data

Execute Query Code & Write To TSV

This part of the code executes all the helper functions above to write the information into a Tabs Separated File (TSV). Because the genres have commas in the string, I decided to use TSV to prevent problems with reading the file later on. To do this, you need to import the csv library:

import csv

To make this a TSV instead of CSV, just add this

delimiter = '\t'

field in your CSV writer. You can actually make a file with any separator you would with this feature.

If your computer is fast enough and you're also patient enough, you can watch the progress via the print statements. It shows the country and the number of bands queried.

In [47]:
# Creates a new TSV file and uses the two helper functions
# above to write the data to a file
with open('encyclopedia_metallum_data.tsv', 'w') as tsvfile:
   
    writer = csv.writer(tsvfile, delimiter = '\t')
    headings = ['band_name','band_id',
                'country', 'CID','location',
                'genres', 'status',
                'website']
    
    writer.writerow(headings)
    
    # Goes through every country retrieved earlier and calls helper function
    for CID,country in all_countries.items():
        get_bands_by_country(tsvfile,writer,country,CID)

tsvfile.close() 
Afghanistan (AF): SUCCESS [1]
Åland Islands (AX): SUCCESS [7]
Albania (AL): SUCCESS [11]
Algeria (DZ): SUCCESS [23]
Andorra (AD): SUCCESS [4]
Angola (AO): SUCCESS [4]
Argentina (AR): SUCCESS [2016]
Armenia (AM): SUCCESS [18]
Aruba (AW): SUCCESS [1]
Australia (AU): SUCCESS [2382]
Austria (AT): SUCCESS [1125]
Azerbaijan (AZ): SUCCESS [8]
Bahrain (BH): SUCCESS [12]
Bangladesh (BD): SUCCESS [65]
Barbados (BB): SUCCESS [2]
Belarus (BY): SUCCESS [375]
Belgium (BE): SUCCESS [1196]
Belize (BZ): SUCCESS [6]
Bolivia (BO): SUCCESS [245]
Bosnia and Herzegovina (BA): SUCCESS [126]
Botswana (BW): SUCCESS [9]
Brazil (BR): SUCCESS [5622]
Brunei (BN): SUCCESS [20]
Bulgaria (BG): SUCCESS [381]
Cambodia (KH): SUCCESS [2]
Canada (CA): SUCCESS [3752]
Chile (CL): SUCCESS [1999]
China (CN): SUCCESS [300]
Colombia (CO): SUCCESS [1432]
Costa Rica (CR): SUCCESS [217]
Croatia (HR): SUCCESS [405]
Cuba (CU): SUCCESS [78]
Curaçao (CW): SUCCESS [1]
Cyprus (CY): SUCCESS [71]
Czech Republic (CZ): SUCCESS [1376]
Denmark (DK): SUCCESS [924]
Dominican Republic (DO): SUCCESS [31]
Ecuador (EC): SUCCESS [351]
Egypt (EG): SUCCESS [35]
El Salvador (SV): SUCCESS [140]
Estonia (EE): SUCCESS [195]
Ethiopia (ET): SUCCESS [1]
Faroe Islands (FO): SUCCESS [13]
Finland (FI): SUCCESS [3657]
France (FR): SUCCESS [4813]
French Polynesia (PF): SUCCESS [1]
Georgia (GE): SUCCESS [22]
Germany (DE): SUCCESS [10402]
Gibraltar (GI): SUCCESS [5]
Greece (GR): SUCCESS [1899]
Greenland (GL): SUCCESS [5]
Guam (GU): SUCCESS [5]
Guatemala (GT): SUCCESS [136]
Guernsey (GG): SUCCESS [13]
Guyana (GY): SUCCESS [3]
Honduras (HN): SUCCESS [46]
Hungary (HU): SUCCESS [1086]
Iceland (IS): SUCCESS [119]
India (IN): SUCCESS [194]
Indonesia (ID): SUCCESS [1623]
International (XX): SUCCESS [657]
Iran (IR): SUCCESS [78]
Iraq (IQ): SUCCESS [10]
Ireland (IE): SUCCESS [385]
Isle of Man (IM): SUCCESS [2]
Israel (IL): SUCCESS [259]
Italy (IT): SUCCESS [6179]
Jamaica (JM): SUCCESS [1]
Japan (JP): SUCCESS [1843]
Jersey (JE): SUCCESS [5]
Jordan (JO): SUCCESS [30]
Kazakhstan (KZ): SUCCESS [50]
Kenya (KE): SUCCESS [4]
Korea, South (KR): SUCCESS [235]
Kuwait (KW): SUCCESS [6]
Kyrgyzstan (KG): SUCCESS [7]
Laos (LA): SUCCESS [6]
Latvia (LV): SUCCESS [106]
Lebanon (LB): SUCCESS [47]
Libya (LY): SUCCESS [4]
Liechtenstein (LI): SUCCESS [12]
Lithuania (LT): SUCCESS [171]
Luxembourg (LU): SUCCESS [82]
Macedonia (FYROM) (MK): SUCCESS [101]
Madagascar (MG): SUCCESS [10]
Malaysia (MY): SUCCESS [591]
Maldives (MV): SUCCESS [8]
Malta (MT): SUCCESS [64]
Mauritius (MU): SUCCESS [2]
Mexico (MX): SUCCESS [2532]
Moldova (MD): SUCCESS [32]
Monaco (MC): SUCCESS [6]
Mongolia (MN): SUCCESS [6]
Montenegro (ME): SUCCESS [11]
Morocco (MA): SUCCESS [22]
Mozambique (MZ): SUCCESS [2]
Myanmar (MM): SUCCESS [6]
Namibia (NA): SUCCESS [2]
Nepal (NP): SUCCESS [24]
Netherlands (NL): SUCCESS [2264]
New Caledonia (NC): SUCCESS [6]
New Zealand (NZ): SUCCESS [364]
Nicaragua (NI): SUCCESS [26]
Norway (NO): SUCCESS [1622]
Oman (OM): SUCCESS [2]
Pakistan (PK): SUCCESS [49]
Panama (PA): SUCCESS [76]
Paraguay (PY): SUCCESS [154]
Peru (PE): SUCCESS [473]
Philippines (PH): SUCCESS [248]
Poland (PL): SUCCESS [3273]
Portugal (PT): SUCCESS [1274]
Puerto Rico (PR): SUCCESS [150]
Qatar (QA): SUCCESS [2]
Reunion (RE): SUCCESS [6]
Romania (RO): SUCCESS [378]
Russia (RU): SUCCESS [3647]
San Marino (SM): SUCCESS [2]
Saudi Arabia (SA): SUCCESS [12]
Serbia (RS): SUCCESS [393]
Singapore (SG): SUCCESS [253]
Slovakia (SK): SUCCESS [553]
Slovenia (SI): SUCCESS [269]
South Africa (ZA): SUCCESS [201]
Spain (ES): SUCCESS [3400]
Sri Lanka (LK): SUCCESS [37]
Suriname (SR): SUCCESS [2]
Svalbard (SJ): SUCCESS [3]
Sweden (SE): SUCCESS [4388]
Switzerland (CH): SUCCESS [989]
Syria (SY): SUCCESS [31]
Taiwan (TW): SUCCESS [69]
Tajikistan (TJ): SUCCESS [5]
Thailand (TH): SUCCESS [173]
Trinidad and Tobago (TT): SUCCESS [7]
Tunisia (TN): SUCCESS [23]
Turkey (TR): SUCCESS [498]
Turkmenistan (TM): SUCCESS [3]
Uganda (UG): SUCCESS [2]
Ukraine (UA): SUCCESS [912]
United Arab Emirates (AE): SUCCESS [26]
United Kingdom (GB): SUCCESS [4772]
United States (US): SUCCESS [25120]
Unknown (ZZ): SUCCESS [71]
Uruguay (UY): SUCCESS [194]
Uzbekistan (UZ): SUCCESS [7]
Venezuela (VE): SUCCESS [397]
Vietnam (VN): SUCCESS [39]

Load In Our TSV

After all that hard work, let's reap the reward. Using Pandas, we can easily read in a CSV,TSV, or any value separated file as a DataFrame. Make sure you've imported the library:

import pandas as pd
In [121]:
with open('encyclopedia_metallum_data.tsv', 'r') as MA_tsv:
    metal_data = pd.read_csv(MA_tsv, sep = '\t')
    
metal_data[['CID']] = metal_data[['CID']].replace(np.NAN, 'NA')
metal_data
Out[121]:
band_name band_id country CID location genres status website
0 District Unknown 3540379792 Afghanistan AF Kabul Progressive Groove Metal Active https://www.metal-archives.com/bands/District_...
1 Contamination of Expressions 118214 Åland Islands AX Mariehamn Doom/Heavy Metal Split-up https://www.metal-archives.com/bands/Contamina...
2 Eternity 3540369455 Åland Islands AX NaN Heavy Metal Split-up https://www.metal-archives.com/bands/Eternity/...
3 Haudankaivaja 127805 Åland Islands AX Ödkarby Death Metal Changed name https://www.metal-archives.com/bands/Haudankai...
4 Hiisi 113902 Åland Islands AX NaN Folk/Black Metal Split-up https://www.metal-archives.com/bands/Hiisi/113902
5 Lars Eric Mattsson 15483 Åland Islands AX NaN Progressive Metal/Shred, Hard Rock Active https://www.metal-archives.com/bands/Lars_Eric...
6 Red Room Ensemble 3540393255 Åland Islands AX Mariehamn Melodic Heavy Metal Active https://www.metal-archives.com/bands/Red_Room_...
7 Vorum 3540278890 Åland Islands AX Mariehamn Death Metal Active https://www.metal-archives.com/bands/Vorum/354...
8 Aten 3540381609 Albania AL Tirana, Tirana County Death Metal Active https://www.metal-archives.com/bands/Aten/3540...
9 Centaur 3540334105 Albania AL Tirana, Tirana County Heavy Metal (early), Hard Rock (later) Active https://www.metal-archives.com/bands/Centaur/3...
10 Crossbones 109767 Albania AL Tirana, Tirana County Heavy/Thrash Metal Active https://www.metal-archives.com/bands/Crossbone...
11 Ethernil 3540389667 Albania AL Tirana, Tirana County Death Metal Active https://www.metal-archives.com/bands/Ethernil/...
12 Gverr 3540318477 Albania AL Tirana, Tirana County Progressive Death Metal Active https://www.metal-archives.com/bands/Gverr/354...
13 Nihil 3540325402 Albania AL Tirana, Tirana County Black Metal Active https://www.metal-archives.com/bands/Nihil/354...
14 Pureblood 3540357028 Albania AL NaN Black Metal Split-up https://www.metal-archives.com/bands/Pureblood...
15 Radical Obscurity 3540334106 Albania AL Tirana, Tirana County Death Metal (early), Crossover/Thrash/Groove M... Split-up https://www.metal-archives.com/bands/Radical_O...
16 Shiptarian Darkness 3540426934 Albania AL Durres Raw Black Metal Unknown https://www.metal-archives.com/bands/Shiptaria...
17 Thunder Way 69422 Albania AL Tirana, Tirana County Power/Speed Metal Split-up https://www.metal-archives.com/bands/Thunder_W...
18 Wild Feeling 3540262336 Albania AL Peshkopi, Dibër County Folk/Heavy Metal Split-up https://www.metal-archives.com/bands/Wild_Feel...
19 Azael 14086 Algeria DZ Reghaia, Algiers Province Black Metal Split-up https://www.metal-archives.com/bands/Azael/14086
20 Azar 97016 Algeria DZ Tizi Ouzou, Tizi Ouzou Province Folk/Black Metal Unknown https://www.metal-archives.com/bands/Azar/97016
21 Bactotavaiy 86147 Algeria DZ Guelma, Guelma Province Black Metal Split-up https://www.metal-archives.com/bands/Bactotava...
22 Barbaros 14079 Algeria DZ Algiers, Algiers Province Black Metal Active https://www.metal-archives.com/bands/Barbaros/...
23 Carnavage 36073 Algeria DZ Algiers, Algiers Province Grindcore Split-up https://www.metal-archives.com/bands/Carnavage...
24 Celestial Decay 95405 Algeria DZ Kouba, Algiers Province Black Metal Split-up https://www.metal-archives.com/bands/Celestial...
25 Devast 96684 Algeria DZ Algiers, Algiers Province (early) / Buenos Air... Technical/Brutal Death Metal Active https://www.metal-archives.com/bands/Devast/96684
26 Entropy 12799 Algeria DZ Constantine, Constantine Province Heavy Metal/Hard Rock Split-up https://www.metal-archives.com/bands/Entropy/1...
27 Hellium 120259 Algeria DZ Blida, Blida Province Death/Groove Metal Active https://www.metal-archives.com/bands/Hellium/1...
28 Homicidal 33921 Algeria DZ Algiers, Algiers Province Death Metal/Grindcore Active https://www.metal-archives.com/bands/Homicidal...
29 Lelahell 3540341906 Algeria DZ Algiers, Algiers Province Death Metal Active https://www.metal-archives.com/bands/Lelahell/...
... ... ... ... ... ... ... ... ...
119444 Da Vàng 2002 Vietnam VN Ho Chi Minh City (early), Los Angeles, United ... Progressive Metal Active https://www.metal-archives.com/bands/Da_V%C3%A...
119445 Deep Wound 3540274641 Vietnam VN NaN Black Metal Split-up https://www.metal-archives.com/bands/Deep_Woun...
119446 Disgusted 109162 Vietnam VN Ho Chi Minh City Death Metal On hold https://www.metal-archives.com/bands/Disgusted...
119447 End of Road 82113 Vietnam VN Ho Chi Minh City Death Metal Split-up https://www.metal-archives.com/bands/End_of_Ro...
119448 Final Stage 47865 Vietnam VN Hanoi Melodic Death Metal Split-up https://www.metal-archives.com/bands/Final_Sta...
119449 God Father 87618 Vietnam VN Ho Chi Minh City Heavy Metal Split-up https://www.metal-archives.com/bands/God_Fathe...
119450 Maxich 104398 Vietnam VN Ho Chi Minh City Hard Rock/Progressive Heavy Metal Split-up https://www.metal-archives.com/bands/Maxich/10...
119451 Motorbreath 47412 Vietnam VN NaN Black Metal Split-up https://www.metal-archives.com/bands/Motorbrea...
119452 Ngũ Cung 121049 Vietnam VN Hanoi Progressive Metal/Rock Active https://www.metal-archives.com/bands/Ng%C5%A9_...
119453 Nhiệt Huyết 122230 Vietnam VN Da Nang Thrash Metal Split-up https://www.metal-archives.com/bands/Nhi%E1%BB...
119454 Nostalgia 3540261512 Vietnam VN NaN Black Metal Split-up https://www.metal-archives.com/bands/Nostalgia...
119455 Omerta 3540320447 Vietnam VN Ho Chi Minh City Thrash/Death Metal Split-up https://www.metal-archives.com/bands/Omerta/35...
119456 Rot 28262 Vietnam VN Ho Chi Minh City Raw Black Metal Active https://www.metal-archives.com/bands/Rot/28262
119457 Rotten Fetus 84955 Vietnam VN Gia Lai Death Metal/Grindcore Split-up https://www.metal-archives.com/bands/Rotten_Fe...
119458 SagoMetal 86337 Vietnam VN Ho Chi Minh City Melodic Death Metal/Metalcore Active https://www.metal-archives.com/bands/SagoMetal...
119459 Seismic Origin 3540389582 Vietnam VN Ho Chi Minh City Melodic Death/Groove Metal Split-up https://www.metal-archives.com/bands/Seismic_O...
119460 Selbstmord 114844 Vietnam VN Vietnam (early), United States (later) Black Metal Unknown https://www.metal-archives.com/bands/Selbstmor...
119461 Silent 120672 Vietnam VN Da Nang Melodic Death Metal Split-up https://www.metal-archives.com/bands/Silent/12...
119462 Sói Đen 18310 Vietnam VN Da Nang Folk Metal Split-up https://www.metal-archives.com/bands/S%C3%B3i_...
119463 Sorrow Decadence 87549 Vietnam VN Little SaiGon, California (USA) Melodic Death Metal Split-up https://www.metal-archives.com/bands/Sorrow_De...
119464 Steel Owl 99476 Vietnam VN California, USA Heavy Metal/Hard Rock Split-up https://www.metal-archives.com/bands/Steel_Owl...
119465 Still Rock 3540263814 Vietnam VN Ho Chi Minh City Heavy Metal/Hard Rock Unknown https://www.metal-archives.com/bands/Still_Roc...
119466 Thập Tự Ngược 3540282770 Vietnam VN NaN Black Metal Split-up https://www.metal-archives.com/bands/Th%E1%BA%...
119467 The Light 12893 Vietnam VN Hanoi Thrash Metal Split-up https://www.metal-archives.com/bands/The_Light...
119468 Thuỷ Triều Ðỏ 15789 Vietnam VN Hanoi Progressive Metal Unknown https://www.metal-archives.com/bands/Thu%E1%BB...
119469 Titanium 82114 Vietnam VN Ho Chi Minh City Thrash Metal Changed name https://www.metal-archives.com/bands/Titanium/...
119470 Unlimited 57341 Vietnam VN Ho Chi Minh City Melodic Power/Speed Metal Unknown https://www.metal-archives.com/bands/Unlimited...
119471 Voluptuary 3540386559 Vietnam VN Hanoi Groove/Death Metal Active https://www.metal-archives.com/bands/Voluptuar...
119472 Wừu 3540356050 Vietnam VN Ho Chi Minh City Grindcore Split-up https://www.metal-archives.com/bands/W%E1%BB%A...
119473 Đông Đô 3540316361 Vietnam VN Hanoi Symphonic Power Metal Active https://www.metal-archives.com/bands/%C4%90%C3...

119474 rows × 8 columns

Scrap Population Data


I'm sorry we're still scrapping data. This is it, I promise, then we get to the fun stuff. Here we extract population data from Wikipedia. Just like the part where we scrapped the country's page, we use regexes and BeautifulSoup.

Wikipedia HTML

Here's one row in the population data table for China:

<tr>
 <td>1</td>
 <td style="text-align:left;"><span class="flagicon" style="display:inline-block;width:25px;"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/35px-Flag_of_the_People%27s_Republic_of_China.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/45px-Flag_of_the_People%27s_Republic_of_China.svg.png 2x" width="23"/></span> <a href="/wiki/China" title="China">China</a><sup class="reference" id="cite_ref-5"><a href="#cite_note-5">[Note 2]</a></sup></td>
 <td>1,388,060,000</td>
 <td>December 14, 2017</td>
 <td>18.3%</td>
 <td style="text-align:left;"><a class="external text" href="http://worldpopulationclock.info/china" rel="nofollow">Official population clock</a></td>
</tr>

The country name is hidding in this tag:

<a href="/wiki/(COUNTRY LINK)" title="(COUNTRY)">(COUNTRY)</a>

We do the same thing we did earlier when extracting countries from Encyclopedia Metallum.

The population is in a <td> tag, so we use:

.find_all('td', style = False)

and that will give you this in a list:

<td>1</td>
<td>1,388,060,000</td>
<td>December 14, 2017</td>
<td>18.3%</td>

We simply get the 2nd element and parse the population using regexes.

In [13]:
# HTTP Request
WORLD_POPULATION_URL = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
r = requests.get(WORLD_POPULATION_URL)

# BeautifulSoup & HTML
soup = BeautifulSoup(r.content, 'html5lib')
population_soup = soup.find('table',{'class': 'wikitable sortable'})
world_population = population_soup.find_all('tr')
world_population = world_population[1:] # Remove header row

# Regexes
population_regex = re.compile('<td>([0-9,]+)</td>')
country_regex = re.compile('<a href="/wiki/.*" title="(.*)">.*</a>')
country_redirect_regex = re.compile('<a class="mw-redirect" href="/wiki/.*" title="(.*)">.*</a>')

p_countries = []
population = []

with open('population_data_2017.csv', 'w') as population_csv:
    
    writer =  csv.writer(population_csv)
    writer.writerow(['country', 'population']) # Data Header
    
    # Parse tags
    for i in range(len(world_population)):
        country = world_population[i].find('a', href = True, class_ = False)
        # Territory of other countries, name in redirect link
        redirect = world_population[i].find('a', {'class', 'mw-redirect'})
        population_of_country = world_population[i].find_all('td', style = False)[1]
        
        c = country_regex.match(str(country).replace('\n',' '))
        r = country_redirect_regex.match(str(redirect))
        p = population_regex.match(str(population_of_country))
       
        if redirect != None:
            ctry = r.groups()[0]
        elif c != None:
            ctry = c.groups()[0]
            
        if p != None:
            pop = p.groups()[0].replace(',','')
            
        writer.writerow([ctry,pop])
        
    population_csv.close()

Load In Population CSV

It's a thing of beauty, congratulations! You are now well versed in web scrapping with Python.

In [259]:
with open('population_data_2017.csv', 'r') as population_csv:
    population_data = pd.read_csv(population_csv)
    population_csv.close()
    
population_data.set_index('country', inplace = True)
population_data
Out[259]:
population
country
China 1388060000
India 1325200000
United States 326275000
Indonesia 261890900
Pakistan 209904000
Brazil 208390000
Nigeria 193392500
Bangladesh 163646000
Russia 146854419
Japan 126670000
Mexico 123675351
Philippines 105021000
Egypt 96186400
Ethiopia 94352000
Vietnam 93700000
Germany 82800000
Democratic Republic of the Congo 81339900
Iran 81039400
Turkey 79814871
France 67182000
Thailand 66061000
United Kingdom 65648000
Italy 60504233
South Africa 56717000
Myanmar 53370609
Tanzania 51557365
South Korea 51446201
Kenya 49699862
Colombia 49538200
Spain 46549045
... ...
Marshall Islands 55000
South Ossetia 53532
Faroe Islands 50358
Saint Kitts and Nevis 46204
Sint Maarten 39410
Turks and Caicos Islands 37910
Liechtenstein 37815
Monaco 37550
Saint-Martin 36457
San Marino 33265
Gibraltar 33140
British Virgin Islands 28514
Cook Islands 18100
Palau 17800
Anguilla 13452
Wallis and Futuna 11800
Nauru 10800
Tuvalu 10100
Saint Barthélemy 9417
Saint Pierre and Miquelon 6286
Saint Helena, Ascension and Tristan da Cunha 5633
Montserrat 4922
Falkland Islands 2563
Norfolk Island 2302
Christmas Island 2072
Niue 1470
Tokelau 1411
Vatican City 800
Cocos (Keeling) Islands 550
Pitcairn Islands 57

240 rows × 1 columns

In [1]:
!pip install folium
import folium
import matplotlib.pyplot as plt
import seaborn as sns
Collecting folium
  Downloading folium-0.5.0.tar.gz (79kB)
    100% |████████████████████████████████| 81kB 1.1MB/s ta 0:00:01    64% |████████████████████▊           | 51kB 1.3MB/s eta 0:00:01
Collecting branca (from folium)
  Downloading branca-0.2.0-py3-none-any.whl
Requirement already satisfied: jinja2 in /opt/conda/lib/python3.6/site-packages (from folium)
Requirement already satisfied: requests in /opt/conda/lib/python3.6/site-packages (from folium)
Requirement already satisfied: six in /opt/conda/lib/python3.6/site-packages (from folium)
Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.6/site-packages (from jinja2->folium)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/lib/python3.6/site-packages (from requests->folium)
Requirement already satisfied: idna<2.7,>=2.5 in /opt/conda/lib/python3.6/site-packages (from requests->folium)
Requirement already satisfied: urllib3<1.23,>=1.21.1 in /opt/conda/lib/python3.6/site-packages (from requests->folium)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.6/site-packages (from requests->folium)
Building wheels for collected packages: folium
  Running setup.py bdist_wheel for folium ... done
  Stored in directory: /home/jovyan/.cache/pip/wheels/04/d0/a0/b2b8356443364ae79743fce0b9b6a5b045f7560742129fde22
Successfully built folium
Installing collected packages: branca, folium
Successfully installed branca-0.2.0 folium-0.5.0

Genre Analysis


Answers Question #1:

Tranform DataFrame for Genre Analysis

Our metal bands data has a genre column, but right now it isn't in a form where we can do any genre specific analysis. We need to determine the big major metal genre's and show whether the band belongs to any of these major groups. Encyclopedia Metallum has us covered with the list above. Metal genres usually have an umbrella term and anything attached to them are subgenres, like Melodic Death Metal for example. To do this task, were going to use regexes again and put the major umbrella terms as the keywords. Unlike regular regex matching, you want to use:

regex = re.compile(genre_str)
regex.search(band_genre)

Search, goes through the entire string instead of matching from the beginning only, very similar to most find tools to search for keywords. We also want to keep track of number of genre types each band is classified as. This will allow us to see whether the genre type is mostly pure or mixed with other genres.

If you're interested in learning more about metal genres, you can check out Patrick Galbraith and Nick Grant's Interactive Map of Metal. The map allows you to see the evolution of metal music.

In [463]:
genre_regex_str = ['Black', 'Death', 'Doom|Stoner|Sludge', 
                   'Electronic|Industrial', 'Experimental|Avant-garde', 
                   'Folk|Viking|Pagan', 'Gothic', 'Grindcore|Goregrind', 
                   'Groove', 'Hard Rock', 'Heavy', 'Metalcore|Deathcore', 
                   'Power', 'Progressive', 'Speed', 'Symphonic', 'Thrash']

genre_regexes = []
genre_count = {}

# Create genre keyword regexes
for g in range(len(genre_regex_str)):
    genre_count[genre_regex_str[g]] = []
    genre_regexes.append(re.compile(genre_regex_str[g]))

# For every band, determine if they belong to any of the major metal genre types
# Also count how many genre types each band belongs to
genre_mix = []
for idx,band in metal_data.iterrows():
    mix = 0
    
    for idx in range(len(genre_regexes)):
        genre_group = genre_regex_str[idx]
        if genre_regexes[idx].search(band['genres']) != None:
            genre_count[genre_group].append(True)
            mix = mix + 1
        else:
            genre_count[genre_group].append(False)
    
    genre_mix.append(mix)

# Drop other columns not relevant to genre analysis
genre_data = metal_data.copy(deep = True)
genre_data = genre_data.drop(['band_id', 'location', 'website'], 1)
genre_data.insert(5, 'genre groups', genre_mix)

# Insert each genre column into original DataFrame
for idx in range(len(genre_regex_str)):
    genre = genre_regex_str[idx]
    genre_data.insert((idx + 6), genre, genre_count[genre])
    
genre_data
Out[463]:
band_name country CID genres status genre groups Black Death Doom|Stoner|Sludge Electronic|Industrial ... Grindcore|Goregrind Groove Hard Rock Heavy Metalcore|Deathcore Power Progressive Speed Symphonic Thrash
0 District Unknown Afghanistan AF Progressive Groove Metal Active 2 False False False False ... False True False False False False True False False False
1 Contamination of Expressions Åland Islands AX Doom/Heavy Metal Split-up 2 False False True False ... False False False True False False False False False False
2 Eternity Åland Islands AX Heavy Metal Split-up 1 False False False False ... False False False True False False False False False False
3 Haudankaivaja Åland Islands AX Death Metal Changed name 1 False True False False ... False False False False False False False False False False
4 Hiisi Åland Islands AX Folk/Black Metal Split-up 2 True False False False ... False False False False False False False False False False
5 Lars Eric Mattsson Åland Islands AX Progressive Metal/Shred, Hard Rock Active 2 False False False False ... False False True False False False True False False False
6 Red Room Ensemble Åland Islands AX Melodic Heavy Metal Active 1 False False False False ... False False False True False False False False False False
7 Vorum Åland Islands AX Death Metal Active 1 False True False False ... False False False False False False False False False False
8 Aten Albania AL Death Metal Active 1 False True False False ... False False False False False False False False False False
9 Centaur Albania AL Heavy Metal (early), Hard Rock (later) Active 2 False False False False ... False False True True False False False False False False
10 Crossbones Albania AL Heavy/Thrash Metal Active 2 False False False False ... False False False True False False False False False True
11 Ethernil Albania AL Death Metal Active 1 False True False False ... False False False False False False False False False False
12 Gverr Albania AL Progressive Death Metal Active 2 False True False False ... False False False False False False True False False False
13 Nihil Albania AL Black Metal Active 1 True False False False ... False False False False False False False False False False
14 Pureblood Albania AL Black Metal Split-up 1 True False False False ... False False False False False False False False False False
15 Radical Obscurity Albania AL Death Metal (early), Crossover/Thrash/Groove M... Split-up 3 False True False False ... False True False False False False False False False True
16 Shiptarian Darkness Albania AL Raw Black Metal Unknown 1 True False False False ... False False False False False False False False False False
17 Thunder Way Albania AL Power/Speed Metal Split-up 2 False False False False ... False False False False False True False True False False
18 Wild Feeling Albania AL Folk/Heavy Metal Split-up 2 False False False False ... False False False True False False False False False False
19 Azael Algeria DZ Black Metal Split-up 1 True False False False ... False False False False False False False False False False
20 Azar Algeria DZ Folk/Black Metal Unknown 2 True False False False ... False False False False False False False False False False
21 Bactotavaiy Algeria DZ Black Metal Split-up 1 True False False False ... False False False False False False False False False False
22 Barbaros Algeria DZ Black Metal Active 1 True False False False ... False False False False False False False False False False
23 Carnavage Algeria DZ Grindcore Split-up 1 False False False False ... True False False False False False False False False False
24 Celestial Decay Algeria DZ Black Metal Split-up 1 True False False False ... False False False False False False False False False False
25 Devast Algeria DZ Technical/Brutal Death Metal Active 1 False True False False ... False False False False False False False False False False
26 Entropy Algeria DZ Heavy Metal/Hard Rock Split-up 2 False False False False ... False False True True False False False False False False
27 Hellium Algeria DZ Death/Groove Metal Active 2 False True False False ... False True False False False False False False False False
28 Homicidal Algeria DZ Death Metal/Grindcore Active 2 False True False False ... True False False False False False False False False False
29 Lelahell Algeria DZ Death Metal Active 1 False True False False ... False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
119444 Da Vàng Vietnam VN Progressive Metal Active 1 False False False False ... False False False False False False True False False False
119445 Deep Wound Vietnam VN Black Metal Split-up 1 True False False False ... False False False False False False False False False False
119446 Disgusted Vietnam VN Death Metal On hold 1 False True False False ... False False False False False False False False False False
119447 End of Road Vietnam VN Death Metal Split-up 1 False True False False ... False False False False False False False False False False
119448 Final Stage Vietnam VN Melodic Death Metal Split-up 1 False True False False ... False False False False False False False False False False
119449 God Father Vietnam VN Heavy Metal Split-up 1 False False False False ... False False False True False False False False False False
119450 Maxich Vietnam VN Hard Rock/Progressive Heavy Metal Split-up 3 False False False False ... False False True True False False True False False False
119451 Motorbreath Vietnam VN Black Metal Split-up 1 True False False False ... False False False False False False False False False False
119452 Ngũ Cung Vietnam VN Progressive Metal/Rock Active 1 False False False False ... False False False False False False True False False False
119453 Nhiệt Huyết Vietnam VN Thrash Metal Split-up 1 False False False False ... False False False False False False False False False True
119454 Nostalgia Vietnam VN Black Metal Split-up 1 True False False False ... False False False False False False False False False False
119455 Omerta Vietnam VN Thrash/Death Metal Split-up 2 False True False False ... False False False False False False False False False True
119456 Rot Vietnam VN Raw Black Metal Active 1 True False False False ... False False False False False False False False False False
119457 Rotten Fetus Vietnam VN Death Metal/Grindcore Split-up 2 False True False False ... True False False False False False False False False False
119458 SagoMetal Vietnam VN Melodic Death Metal/Metalcore Active 2 False True False False ... False False False False True False False False False False
119459 Seismic Origin Vietnam VN Melodic Death/Groove Metal Split-up 2 False True False False ... False True False False False False False False False False
119460 Selbstmord Vietnam VN Black Metal Unknown 1 True False False False ... False False False False False False False False False False
119461 Silent Vietnam VN Melodic Death Metal Split-up 1 False True False False ... False False False False False False False False False False
119462 Sói Đen Vietnam VN Folk Metal Split-up 1 False False False False ... False False False False False False False False False False
119463 Sorrow Decadence Vietnam VN Melodic Death Metal Split-up 1 False True False False ... False False False False False False False False False False
119464 Steel Owl Vietnam VN Heavy Metal/Hard Rock Split-up 2 False False False False ... False False True True False False False False False False
119465 Still Rock Vietnam VN Heavy Metal/Hard Rock Unknown 2 False False False False ... False False True True False False False False False False
119466 Thập Tự Ngược Vietnam VN Black Metal Split-up 1 True False False False ... False False False False False False False False False False
119467 The Light Vietnam VN Thrash Metal Split-up 1 False False False False ... False False False False False False False False False True
119468 Thuỷ Triều Ðỏ Vietnam VN Progressive Metal Unknown 1 False False False False ... False False False False False False True False False False
119469 Titanium Vietnam VN Thrash Metal Changed name 1 False False False False ... False False False False False False False False False True
119470 Unlimited Vietnam VN Melodic Power/Speed Metal Unknown 2 False False False False ... False False False False False True False True False False
119471 Voluptuary Vietnam VN Groove/Death Metal Active 2 False True False False ... False True False False False False False False False False
119472 Wừu Vietnam VN Grindcore Split-up 1 False False False False ... True False False False False False False False False False
119473 Đông Đô Vietnam VN Symphonic Power Metal Active 2 False False False False ... False False False False False True False False True False

119474 rows × 23 columns

Overall Genre Counts

After the DataFrame transformation, we can easily do band counts for each genre. Use .value_counts() on the column you want to tally instead of going through the entire DataFrame with a for loop. The DataFrame is a nice presentation of the data, especially if you sort using:

.sort_values(column_name, ascending = False)

we can easily see which genres are the most popular, however there are some powerful graphing capabilities we can use to visualize the data instead of plain numbers.

In [343]:
genre_total_data = genre_data.copy(deep = True)
genre_count = []
genre_categories = list(genre_total_data.columns)[6:]
for idx in range(len(genre_categories)):
    counts = genre_total_data[genre_categories[idx]].value_counts()
    if True in counts:
        genre_count.append(counts[True])
    else:
        genre_count.append(0)

g = {'genre': genre_categories, 'band count': genre_count}
genre_total_df = pd.DataFrame(data = g, index = g['genre'], columns = ['band count'])
genre_total_df.sort_values('band count', ascending = False)
Out[343]:
band count
Death 40814
Black 31541
Thrash 25267
Heavy 17209
Doom|Stoner|Sludge 11590
Progressive 8191
Power 7327
Grindcore|Goregrind 4885
Metalcore|Deathcore 4515
Groove 4367
Gothic 3486
Folk|Viking|Pagan 2823
Symphonic 2603
Speed 2225
Hard Rock 2209
Experimental|Avant-garde 1175
Electronic|Industrial 1161

Genre Count Bar Graph

For beautiful and easy to use plotting, import matplotlib and seaborn. You can use matplotlib by itself, but seaborn is much more flexible and makes fine tuning how a graph looks much easier.

import matplotlib.pyplot as plt
import seaborn as sns

If you want to look at color options for plots here are some useful links:

Here we are plotting the DataFrame we created above and we can answer one of our analysis questions.

[ANSWER TO Q1A]: Which genre type seems to dominate the metal playing field in general?

Black, Death, Trash, Heavy, and Doom/Stoner/Sludge are the top 5 genres types in metal music. Which is not surprising if you're familiar with metal history, lower count genres tend to be newer genres where as the top 5 are ancestors to many of the other genre types and subgenres. It's pretty evident if you go look at the Map of Metal mentioned earlier.

In [344]:
sns.plt.title('Genre Classification Counts For Metal Bands') # Set graph title
genre_plot = sns.barplot(x = list(range(0,len(genre_total_df))), 
                 y = genre_total_df['band count'], 
                 palette = sns.color_palette("cubehelix", len(genre_total_df)))

genre_plot.set(xlabel = 'Genre Classification', 
       ylabel = 'Band Count', 
       xticklabels = list(genre_total_df.index)) # Set x-axis labels since python can't plot strings as values

sns.plt.xticks(rotation = 90) # Makes x-axis labels vertical
genre_plot.set_facecolor('white') # Changes tthe background to white
plt.show()

Country Genre Counts

So we know the top overall metals genres, but let's look at what genres dominate each country. Instead of using .value_counts() let's use pivot_table from Pandas. The pivot_table allows us to tally the bands with a genre type in a specific country.

In [379]:
genre_by_country = pd.pivot_table(data = genre_data,
                                  index = ['country', 'CID'], # Combines all rows with the same values for these columns
                                  values = genre_regex_str, # Give pandas the columns you want the aggregate function to apply
                                  aggfunc = np.sum) # Specify what type of function, default is 

genre_by_country.head()
Out[379]:
Black Death Doom|Stoner|Sludge Electronic|Industrial Experimental|Avant-garde Folk|Viking|Pagan Gothic Grindcore|Goregrind Groove Hard Rock Heavy Metalcore|Deathcore Power Progressive Speed Symphonic Thrash
country CID
Afghanistan AF 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
Albania AL 3.0 4.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 3.0 0.0 1.0 1.0 1.0 0.0 2.0
Algeria DZ 12.0 7.0 2.0 0.0 0.0 2.0 0.0 2.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0
Andorra AD 0.0 2.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 2.0 0.0 0.0 0.0
Angola AO 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0

Genre by Country Plots

Let's plot the DataFrame from above to see the results. Since I'll be plotting 17 genre plots, let's make it a function so that we don't have to repeat code 17 times.

In [374]:
def genre_bargraph(df, title, ylabel, genre, palette):
    plt.figure(figsize = (26,3)) # Change size of plot in inches
    sns.plt.title(title)
    ax = sns.barplot(x = list(df.index.labels[0]), 
                     y = df[genre], 
                     palette = palette)

    ax.set(xlabel = 'Country', 
           ylabel = ylabel, 
           xticklabels = list(df.index.levels[0]))

    sns.plt.xticks(rotation = 90)
    ax.set_facecolor('white')

15/17 plots shows that the United States has a lot of bands in all 17 genres. However this doesn not help us answer the question of what genres dominate each country. So we have to take the band counts and divide by the total number of bands in that country.

In [402]:
genre_lst = list(genre_by_country.columns)
l = len(genre_by_country)

titles = ['Black Metal','Death Metal','Doom Metal','Electronic/Industrial Metal',
          'Experimental/Avant-Garde Metal','Folk/Viking/Pagan Metal','Gothic Metal', 
          'Grindcore/Goregrind','Groove Metal','Hard Rock','Heavy Metal','Metalcore/Deathcore',
          'Power Metal','Progressive Metal','Speed Metal','Symphonic Metal','Thrash Metal']

for idx in range(len(genre_lst)):
    genre_bargraph(genre_by_country, 
                   'Band Count of ' + titles[idx] + ' by Country',
                   'Band Count',
                   genre_lst[idx], 
                   sns.cubehelix_palette(l, start = 0.5, rot = -0.75))
    
plt.show()

First We need to calculate how many bands are in each country.

In [395]:
band_count = pd.DataFrame(metal_data['country'].value_counts())
band_count = band_count.sort_index()
band_count.columns = ['band count']
band_count.head()
Out[395]:
band count
Afghanistan 1
Albania 11
Algeria 23
Andorra 4
Angola 4

Now we can insert that into our original DataFrame and let's use NumPy to our advantage to calculate genre percentage. If you have two NumPy arrays of the same length, you can directly use:

a = np.array(dataA, dtype = np.float)
b = np.array(dataB, dtype = np.float)
list(a / b)

We do this for every genre and we're done!

In [446]:
genre_percentage_by_country = genre_by_country.copy(deep = True)
genre_percentage_by_country.insert(0, 'band count', list(band_count['band count'])) # Insert total band counts into DataFrame
lst = list(genre_percentage_by_country.columns)[1:]
for idx in range(len(lst)):
    curr_genre = genre_percentage_by_country[lst[idx]]
    genre_arr = np.array(curr_genre, dtype = np.float)
    total_arr = np.array(genre_percentage_by_country['band count'], dtype = np.float)
    genre_percentage_by_country[lst[idx]] = list(genre_arr / total_arr)
    
genre_percentage_by_country.head()
Out[446]:
band count Black Death Doom|Stoner|Sludge Electronic|Industrial Experimental|Avant-garde Folk|Viking|Pagan Gothic Grindcore|Goregrind Groove Hard Rock Heavy Metalcore|Deathcore Power Progressive Speed Symphonic Thrash
country CID
Afghanistan AF 1 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.000000 1.000000 0.000000 0.000000 0.00 0.000000 1.000000 0.000000 0.0 0.000000
Albania AL 11 0.272727 0.363636 0.000000 0.0 0.0 0.090909 0.0 0.000000 0.090909 0.090909 0.272727 0.00 0.090909 0.090909 0.090909 0.0 0.181818
Algeria DZ 23 0.521739 0.304348 0.086957 0.0 0.0 0.086957 0.0 0.086957 0.043478 0.043478 0.043478 0.00 0.000000 0.043478 0.000000 0.0 0.043478
Andorra AD 4 0.000000 0.500000 0.250000 0.0 0.0 0.000000 0.0 0.250000 0.000000 0.000000 0.000000 0.00 0.250000 0.500000 0.000000 0.0 0.000000
Angola AO 4 0.250000 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.000000 0.250000 0.25 0.000000 0.000000 0.000000 0.0 0.250000

Let's regraph you can see for each country what genres are really prevalent in each country.

[ANSWER TO Q1B]: Which genre type seems to dominate the metal playing field in every country?

The answer is actually all of our 17 graphs! Let's look at one as an example: If the bars for that country is high, then this type of genre is a very popular genre for this specific country.

*You can double click on the graphs to enlarge them.

In [447]:
genre_lst = list(genre_percentage_by_country.columns)[1:]
l = len(genre_percentage_by_country)
for idx in range(len(genre_lst)):
    genre_bargraph(genre_percentage_by_country, 
                   'Percentage of ' + titles[idx] + ' by Country',
                   'Percentage',
                   genre_lst[idx], 
                   sns.cubehelix_palette(l, start = 1.5, rot = -0.75, dark = 0, light = 0.95))
    
plt.show()

[ANSWER TO Q1C]: Are most genres mixed or pure?

Looking at the graph we made more than 70,000 bands are pure one type of genre while around 40,000 bands are mix of 2 different genre types. Around 61% are pure and 34% are mix of 2, and around 40% are a mix of 2 or more genre types. I'm actually surprised since most bands are usually listed with so many genre tags.

In [489]:
purity_counts = pd.DataFrame(genre_purity_data['genre groups'].value_counts())
purity_counts
Out[489]:
genre groups
1 72692
2 40039
3 5662
0 678
4 384
5 18
6 1
In [487]:
sns.plt.title('Mixed Metal Genre Counts') # Set graph title
genre_plot = sns.barplot(x = list(purity_counts.index), 
                         y = purity_counts['genre groups'], 
                         palette = sns.color_palette("cubehelix", len(purity_counts) * 2))

genre_plot.set(xlabel = 'Number of Genre Types', ylabel = 'Band Count') # Set x-axis and y-axis label names

genre_plot.set_facecolor('white') # Changes tthe background to white
plt.show()

Top Metal Countries


Answers Question #2:

Let's apply all of the python techiques to create a DataFrame for the status types for every country.

In [470]:
c = list(band_count.index)
status_types = list(overall_status_counts.index)
status_count_data = {}
status_count_data['Total'] = list(band_count['band count'])
status_count_data['Active'] = []
status_count_data['Split-up'] = []
status_count_data['Changed name'] = []
status_count_data['On hold'] = []
status_count_data['Unknown'] = []
status_count_data['Disputed'] = []

for idx in range(len(c)):
    status_counts = metal_data[metal_data['country'] == c[idx]]['status'].value_counts()
    for st in range(len(status_types)):
        if status_types[st] in status_counts:
            status_count_data[status_types[st]].append(status_counts[status_types[st]])
        else:
            status_count_data[status_types[st]].append(0)
            
status_data = pd.DataFrame(data = status_count_data, index = c, columns = ['Total'] + status_types)
status_data.head()
Out[470]:
Total Active Split-up Unknown Changed name On hold Disputed
Afghanistan 1 1 0 0 0 0 0
Albania 11 6 4 1 0 0 0
Algeria 23 10 10 2 0 1 0
Andorra 4 3 1 0 0 0 0
Angola 4 4 0 0 0 0 0

We need to insert the population data we scrapped earlier into the new DataFrame, we have to merge the two data sets together. Since there are some countries with different names in the data set, so I created an alias dictionary to solve the problem.

In [471]:
alias = {'Korea, South': 'South Korea',
         'Macedonia (FYROM)': 'Republic of Macedonia', 
         'Georgia': 'Georgia (country)',
         'Ireland': 'Republic of Ireland'}
pop_dataframe_countries = list(population_data.index)
countries = list(status_data.index)

population_info = []
for c in range(len(countries)):
    
    country = countries[c]
    if country in alias:
        country = alias[country]
    if country in pop_dataframe_countries:
        country_population = population_data.loc[country]['population']
        population_info.append(country_population)
    else:
        population_info.append(0)

status_data.insert(0, 'Population (2017)', population_info)
status_data = status_data[status_data['Population (2017)'] != 0] # Remove countries with no population match
status_data.head()
Out[471]:
Population (2017) Total Active Split-up Unknown Changed name On hold Disputed
Afghanistan 29724323 1 1 0 0 0 0 0
Albania 2876591 11 6 4 1 0 0 0
Algeria 41697498 23 10 10 2 0 1 0
Andorra 78264 4 3 1 0 0 0 0
Angola 28359634 4 4 0 0 0 0 0

Calculating Per Capita

I wanted to calculate 3 different per capitas, one for overall, one for currently active bands, and one for bands that have split-up. This will tell us which countries have the most metal music activity in proportion to population.

The equation for calculating Per Capita (# of bands per person):

$\text{Per Capita} = \dfrac{\text{band count}}{\text{population}} \times \text{(scalar)}$

The scalar let's you control the proportion, if you want the original definition of per capita, the scalar is 1 for each person in the population. I'm going to do per every ten thousand individuals so my scalar is 10,000.

In [472]:
total_arr = np.array(list(status_data['Total']), dtype = np.float)
active_arr = np.array(list(status_data['Active']), dtype = np.float)
split_arr = np.array(list(status_data['Split-up']), dtype = np.float)
population_arr = np.array(list(status_data['Population (2017)']), dtype = np.float)

status_data.insert(2, 'Total Per Capita (10,000)', list((total_arr / population_arr) * 10000))
status_data.insert(4, 'Active Per Capita (10,000)', list((active_arr / population_arr) * 10000))
status_data.insert(6, 'Split-up Per Capita (10,000)', list((split_arr / population_arr) * 10000))
status_data.head()
Out[472]:
Population (2017) Total Total Per Capita (10,000) Active Active Per Capita (10,000) Split-up Split-up Per Capita (10,000) Unknown Changed name On hold Disputed
Afghanistan 29724323 1 0.000336 1 0.000336 0 0.000000 0 0 0 0
Albania 2876591 11 0.038240 6 0.020858 4 0.013905 1 0 0 0
Algeria 41697498 23 0.005516 10 0.002398 10 0.002398 2 0 1 0
Andorra 78264 4 0.511091 3 0.383318 1 0.127773 0 0 0 0
Angola 28359634 4 0.001410 4 0.001410 0 0.000000 0 0 0 0

Metal Choropleth Map

Here, we're going to use folium make an interactive choropleth map, it'll let us see which country is the most metal better than a bar graph. I was inspired by this Metal Injection Blog Post.

!pip install folium
import folium

The GEOJSON file of the world is located here

It isn't exactly difficult if your data matches with the GEOJSON file exactly, but in our case, the file has information for a lot more countries than there are in the DataFrame. The subsequent 2 cells contain code that modify the DataFrame to include all the countries in the GEOJSON file and for countries that were not in our DataFrame, we have to give them filler values.

In [473]:
choropleth_data = status_data.copy(deep = True)
alias = {'Serbia': 'Republic of Serbia',
         'Macedonia (FYROM)': 'Macedonia',
         'Korea, South': 'South Korea',
         'United States': 'United States of America'}

# Going through the JSON file to extract all the countries
with open('countries.geo.json', 'r') as country_boundaries_geojson:
    country_boundaries = json.load(country_boundaries_geojson)
    cb = [country_boundaries['features'][i]['properties']['ADMIN']for i in range(len(country_boundaries['features']))]
    country_boundaries_geojson.close()
    
l = list(choropleth_data.index)
geojson_country_name = []

# Making a column of the country name from the JSON file
# In order to make the map, all country names in DataFrame must
# Match the names in the JSON file.
for i in range(len(l)):
    if l[i] in alias:
        geojson_country_name.append(alias[l[i]])
    elif l[i] in cb:
        geojson_country_name.append(l[i])
        
choropleth_data.insert(0, 'geojson_country_name', geojson_country_name)
choropleth_data = choropleth_data.drop(['Unknown','Changed name','On hold','Disputed'], 1)
choropleth_data.head()
Out[473]:
geojson_country_name Population (2017) Total Total Per Capita (10,000) Active Active Per Capita (10,000) Split-up Split-up Per Capita (10,000)
Afghanistan Afghanistan 29724323 1 0.000336 1 0.000336 0 0.000000
Albania Albania 2876591 11 0.038240 6 0.020858 4 0.013905
Algeria Algeria 41697498 23 0.005516 10 0.002398 10 0.002398
Andorra Andorra 78264 4 0.511091 3 0.383318 1 0.127773
Angola Angola 28359634 4 0.001410 4 0.001410 0 0.000000
In [474]:
# Collect all the countries not matched from the JSON file
no_match = [cb[c] for c in range(len(cb)) if cb[c] not in geojson_country_name]

# Creating a DataFrame for the non-matched countries
# and giving them filler values
filler = [-1] * (len(no_match))
filler_data = {}
choropleth_col = list(choropleth_data.columns)
for idx in range(len(choropleth_data.columns)):
    h = choropleth_col[idx]
    if h == 'country':
        filler_data[h] = [''] * (len(no_match))
    elif h == 'CID':
        filler_data[h] = ['--'] * (len(no_match))
    elif h == 'geojson_country_name':
        filler_data[h] = no_match
    else:
        filler_data[h] = filler

# Combine original DataFrame with non-match DataFrame
geojson_no_matches = pd.DataFrame(data = filler_data,columns = choropleth_col)
choropleth_data = choropleth_data.append(geojson_no_matches)
choropleth_data
Out[474]:
geojson_country_name Population (2017) Total Total Per Capita (10,000) Active Active Per Capita (10,000) Split-up Split-up Per Capita (10,000)
Afghanistan Afghanistan 29724323 1 0.000336 1 0.000336 0 0.000000
Albania Albania 2876591 11 0.038240 6 0.020858 4 0.013905
Algeria Algeria 41697498 23 0.005516 10 0.002398 10 0.002398
Andorra Andorra 78264 4 0.511091 3 0.383318 1 0.127773
Angola Angola 28359634 4 0.001410 4 0.001410 0 0.000000
Argentina Argentina 44044811 2016 0.457716 1326 0.301057 481 0.109207
Armenia Armenia 2979600 18 0.060411 14 0.046986 4 0.013425
Aruba Aruba 110882 1 0.090186 0 0.000000 0 0.000000
Australia Australia 24746700 2382 0.962553 1350 0.545527 717 0.289736
Austria Austria 8817514 1125 1.275870 564 0.639636 407 0.461581
Azerbaijan Azerbaijan 9867250 8 0.008108 6 0.006081 2 0.002027
Bahrain Bahrain 1451200 12 0.082690 6 0.041345 6 0.041345
Bangladesh Bangladesh 163646000 65 0.003972 56 0.003422 7 0.000428
Barbados Barbados 285719 2 0.069999 2 0.069999 0 0.000000
Belarus Belarus 9495800 375 0.394911 247 0.260115 90 0.094779
Belgium Belgium 11385383 1196 1.050470 535 0.469901 546 0.479562
Belize Belize 380010 6 0.157891 4 0.105260 1 0.026315
Bolivia Bolivia 11145770 245 0.219814 181 0.162393 39 0.034991
Bosnia and Herzegovina Bosnia and Herzegovina 3518000 126 0.358158 78 0.221717 34 0.096646
Botswana Botswana 2230905 9 0.040342 8 0.035860 1 0.004482
Brazil Brazil 208390000 5622 0.269783 3413 0.163779 1607 0.077115
Brunei Brunei 417200 20 0.479386 9 0.215724 8 0.191755
Bulgaria Bulgaria 7101859 381 0.536479 180 0.253455 170 0.239374
Cambodia Cambodia 15848495 2 0.001262 2 0.001262 0 0.000000
Canada Canada 36746700 3752 1.021044 2022 0.550253 1294 0.352140
Chile Chile 17373831 1999 1.150581 1483 0.853583 333 0.191668
China China 1388060000 300 0.002161 241 0.001736 43 0.000310
Colombia Colombia 49538200 1432 0.289070 1111 0.224271 204 0.041180
Costa Rica Costa Rica 4947490 217 0.438606 165 0.333502 32 0.064679
Croatia Croatia 4154213 405 0.974914 204 0.491068 136 0.327378
... ... ... ... ... ... ... ... ...
82 Saint Helena -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
83 Solomon Islands -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
84 Sierra Leone -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
85 Somaliland -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
86 Somalia -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
87 Saint Pierre and Miquelon -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
88 Sao Tome and Principe -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
89 Swaziland -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
90 Sint Maarten -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
91 Seychelles -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
92 Turks and Caicos Islands -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
93 Chad -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
94 Togo -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
95 East Timor -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
96 Tonga -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
97 Tuvalu -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
98 United Republic of Tanzania -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
99 United States Minor Outlying Islands -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
100 US Naval Base Guantanamo Bay -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
101 Vatican -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
102 Saint Vincent and the Grenadines -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
103 British Virgin Islands -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
104 United States Virgin Islands -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
105 Vanuatu -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
106 Wallis and Futuna -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
107 Akrotiri Sovereign Base Area -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
108 Samoa -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
109 Yemen -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
110 Zambia -1 -1 -1.000000 -1 -1.000000 -1 -1.000000
111 Zimbabwe -1 -1 -1.000000 -1 -1.000000 -1 -1.000000

255 rows × 8 columns

folium has a built in choropleth function. Because I have to graph 3 of them, I made a function and I had them all added in as layers. Because the map is so tasking on the computer, saving it as an html file is a good option if your computer can't handle the large amount of graphics.

View the full map by downloading it here

With the map layers, we answer our 2nd question.

[ANSWER TO Q2A-C]: Which country is the most metal?

Finland and those Scandinavians! Finland is the top for all 3 maps and that shows that metal music activity in Finland is extremely high even though they are not the one of the most population dense countries in the world.

Here are some resources for choropleth maps:

  1. Color Brewer
  2. Quickstart Example Code
  3. Folium Documentation
In [485]:
# Function to add choropleth overlay to base map
def build_choropleth(base_map, name, field, threshold, color):
    base_map.choropleth(name = name + 'Metal Bands Per Capita (10,000)',
                geo_data = 'countries.geo.json', # JSON file containing the country boundary lines
                data = choropleth_data, # DataFrame to plot from
                columns = ['geojson_country_name', field + ' Per Capita (10,000)'], # Which columns to plot
                key_on = 'feature.properties.ADMIN', # What property in the JSON file to match the column data with
                threshold_scale = threshold, # Adjusts legend color scale
                fill_color = color, fill_opacity = 0.5, line_opacity = 0.5, # Choropleth color overlay options
                legend_name = name + 'Metal Bands Per Capita (10,000)',
                reset = True, highlight = True)
    
base_map = folium.Map(location = [0,0], zoom_start = 1.5)
build_choropleth(base_map, '', 'Total', [0,1,2,5,6,7], 'BuPu')
build_choropleth(base_map, 'Active', 'Active', [-1,0,1,2,3,4], 'BuGn')
build_choropleth(base_map, 'Split-up', 'Split-up', [-1.0, -0.1, 0.4, 0.9, 1.4, 1.9], 'YlOrRd')

folium.LayerControl().add_to(base_map)
base_map.save('metalband_choropleth_map.html') # Save as html because most computers can't handle the loading the map

Conculsion & Major Souces


If you have stuck with me till the end, then congratulations! We did some incredible things in Python 3 and answered these metal music questions:

  1. Which genre type seems to dominate the metal playing field?
  2.  
  3. Which country is the most metal?

In comparison with the kaggle notebook, we had pretty similar results even though the data set here is 20 times larger. Our most metal country was Finland, but for the kaggle analysis it was the Faroes Islands, which is actually an island off of the Scandinavian paninsulas. A lot of the genre analysis was similar as well.

Thanks for reading and following along!

Sources