Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which computers will we use to crawl? #69

Closed
katehausladen opened this issue Oct 8, 2023 · 5 comments
Closed

Which computers will we use to crawl? #69

katehausladen opened this issue Oct 8, 2023 · 5 comments
Assignees
Labels
infrastructure An issue relate to underlying compute or selecting technologies

Comments

@katehausladen
Copy link
Collaborator

katehausladen commented Oct 8, 2023

Back in May in issue #37, we discussed which computers we would use for our crawl but never really came to a concrete conclusion:

We decided in today's call to first test if adding more time would improve the performance of the Mac Minis. If that does not work, we way use everyone's laptop's for the crawl, or we may buy newer dedicated crawl Macs.

Since crawling 10,000 sites will take multiple days to complete, I think it would probably be best if we did not use our own computers. I think now would be a good time to decide whether we should get new Mac minis or if the current ones will work.

Some things to consider would be:

  1. Success rate (i.e. How many sites are incorrectly analyzed?)
  2. Crash rate (i.e. How many sites fail to be analyzed due to selenium/firefox errors?)
  3. Site loading time (i.e. Do the Mac minis just need extra time to load sites? Or do they just not load resource-intensive sites in selenium?)

@Jocelyn0830, if you could do a small crawl on one of the Mac minis (assuming Professor Danner is not actively using both of them), that would be great. You can use this validation set (sites + Ground Truths). You can either run the crawl on your mac to compare or just compare the mac mini results to the results I got on my Mac. The run I did used the VPN, took 1601 seconds, and had no errors logged in error-logging.json. This google colab will help with the comparison.

If Oliver hasn't merged the issue-60 branch by the time you get to this, run the crawl from the issue-60 branch. The sql db creation command is in the PR.

@SebastianZimmeck SebastianZimmeck added the infrastructure An issue relate to underlying compute or selecting technologies label Oct 8, 2023
@Jocelyn0830
Copy link
Collaborator

I just tested the crawler on my own mac (using newly merged main branch). The instructions are really easy to follow and local sql database can be easily set up. I crawled the validation set Kate mentioned above. The crawler ran very well and didn't crash at all. I played around with it and found that the crawler was able to automatically restart even I closed the window.

Result:
There are 50 sites in the validation set. The success rate is 100%, meaning that crawler got every site. Total time used is about 26 minutes.

I will update Mac Mini results shortly.

@Jocelyn0830
Copy link
Collaborator

I finished testing on Mac Mini as well. Looking at the terminal output, the crawler crashes several times but was robust enough to restart each time.

Result:
In the end, the crawler was able to crawl all 50 sites. The success rate is 100%. Total time used is about 29 minutes.

@katehausladen
Copy link
Collaborator Author

Could you save the database data from both crawls as a json? One way to do this is just go to http://localhost:8080/analysis, right click, select save as, and save it as a json.
Screenshot 2023-10-09 at 8 18 38 PM

Then we can compare the crawl data to the ground truths. You can just put the files here.

@Jocelyn0830
Copy link
Collaborator

Could you save the database data from both crawls as a json? One way to do this is just go to http://localhost:8080/analysis, right click, select save as, and save it as a json. Screenshot 2023-10-09 at 8 18 38 PM

Then we can compare the crawl data to the ground truths. You can just put the files here.

@katehausladen should be completed now :)

@katehausladen
Copy link
Collaborator Author

We will be using Sebastian's old computer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infrastructure An issue relate to underlying compute or selecting technologies
3 participants