Welcome Guest ( Log In | Register )




                Web Hosting Guide

2 Pages V   1 2 >  
Reply to this topicNew Topic
Downloading The Internet
FirefoxRocks
post Jan 18 2010, 01:34 AM
Post #1


Super Member
Group Icon

Group: [HOSTED]
Posts: 898
Joined: 12-July 06
From: Ontario, Canada
Member No.: 14,464
myCENTs:7.83


I'm wondering if it is possible to save a copy of everything on the Internet. Ignoring ISP data transfer limitations (max GB per month), I have a download speed of approximately 4 Mbps.

The Internet isn't limited to web pages though, it includes everything that is public accessible (not password-protected) which includes all music, videos, pictures, software, etc. Furthermore, I am not limiting it to HTTP servers as torrents, files on FTP servers and anything on peer-to-peer networks (Gnutella/LimeWire) will count as well.

Saving everything at its current state (ignoring changes to the live version after it is saved), how long will this take? What if I upgrade my Internet connection, or theoretically use all the bandwidth of (for example) educational institutions (universities), ISPs (Shaw, Comcast, etc) and large corporations (Microsoft, Google, etc).

I am not talking about indexing content, I mean saving the actual file. Every web page would be considered one file, and pictures, JavaScript, CSS, etc would be their own files.
Go to the top of the page
 
+Quote Post
levimage
post Jan 18 2010, 01:30 AM
Post #2


Premium Member
Group Icon

Group: [HOSTED]
Posts: 226
Joined: 1-October 07
From: United States
Member No.: 25,237
myCENTs:63.28


I don't know, but some institutions save on bandwidth costs by implementing a cache server of cache box. This device/host, stores copies of internet request, url, dns cache, images, code, etc. for duration till it's purged.

The cache is used if people, staff, or computer labs frequent certain web sites all the time. this makes web sites (pages) appear to load faster than their Internet connection and saves bandwidth.

Companies that do and create cache solutions would have a better chance along with major backbone ISPs, of answering or at least making an educated guess at the question you are trying to ask.

Levimage tongue.gif

P.S. I myself have no clue, 10 GB/ps Ethernet is coming out soon. Probably good new for dvd torrent hippies tongue.gif
Go to the top of the page
 
+Quote Post
yordan
post Jan 18 2010, 02:10 PM
Post #3


Way Out Of Control - You need a life :)
Group Icon

Group: [MODERATOR]
Posts: 3,334
Joined: 16-August 05
Member No.: 7,896
myCENTs:80.50


I also think that would be a big problem : almost each computer is on the internet, so, in order to backup everything reachable through internet, you should have available in your computer the whole amount of disks of all the computers around the wolrd. As you mention it, Google computers disk space would be peanuts because you don't want to only index, you want the whole contents.
Go to the top of the page
 
+Quote Post
Бојан
post Jan 18 2010, 03:48 PM
Post #4


Member - Active Contributor
Group Icon

Group: Members
Posts: 89
Joined: 17-January 10
From: Macedonia
Member No.: 45,766
myCENTs:7.84


First of all you can't find enough free space to do that or internet to download it, second why would you need a backup of the internet? smile.gif Greetings.
Go to the top of the page
 
+Quote Post
FirefoxRocks
post Jan 19 2010, 05:48 AM
Post #5


Super Member
Group Icon

Group: [HOSTED]
Posts: 898
Joined: 12-July 06
From: Ontario, Canada
Member No.: 14,464
myCENTs:7.83


The disk space available to me is in truly infinite amounts, the only question here is process and bandwidth, as well as CPU and disk speed to save and access everything.
A "backup" of the Internet is for experimental purposes only, and believe me, even if I manage to backup the entire Internet, including private network content, I would be using almost no disk space (an analogy would be say .... 1 electron out of the entire universe)
Go to the top of the page
 
+Quote Post
Quatrux
post Jan 19 2010, 04:39 PM
Post #6


the Q
Group Icon

Group: [HOSTED]
Posts: 1,432
Joined: 13-July 05
From: Lithuania, Vilnius
Member No.: 7,059
myCENTs:52.08


Well, there are a lot of data centers with more than millions of terabytes of data, which take a lot of energy, so I doubt you could compete with them biggrin.gif

For example, a lot of content is dynamic, so it would be really hard to find the difference between different files and you could end just by an infinite loop unless you would find differences and ignore some of content in the Internet.

Moreover, as I know google doesn't offer google cache version anymore? or does it? maybe because also due to resources?

http://www.archive.org/ - offers a way back machine, quite cool, but also it's usually slow, it doesn't offer all the content, I mean it doesn't cache everything and it rarely caches images, well, but it's a non profit project smile.gif
Go to the top of the page
 
+Quote Post
tansqrx
post Jan 20 2010, 07:55 PM
Post #7


Super Member
Group Icon

Group: [HOSTED]
Posts: 711
Joined: 25-April 05
Member No.: 4,374
myCENTs:0.44


Interesting question. I am actually surprised that that you, FireFoxRules, asked it as it sounds like a crazy idea that I would expect from a newb. At any rate it did get me to think so I will propose an answer.

Assumptions
• You have an insane Internet backbone connection will guaranteed reliability and speed. I will assume that you have a 100 Mb/sec connection which is usually only available to ISP level organizations.
• You have an appropriately sized upstream connection to do all the requesting.
• You actually get the bandwidth you paid for. I personally have a “10 Mb/sec down and 1 Mb/sec up” consumer cable connection. I have never seen anything close to these numbers in real life. The closest I have seen is 2 Mb/sec down (downloading ISOs from Microsoft MSDN) and there is a hard limit of around 115 kb/sec up that I constantly hit. A more typical download speed is around 500 kb/sec for regular web browsing.
• We will ignore all network structure and latency issues and assume you have a direct connection to your target with no hops in between.
o The nature of TCP/IP will limit you to around 80% of your bandwidth under ideal operation. When you have only two computers on a network (the idea case) you will still never get 100% bandwidth because of TCP header overhead, IP header overhead, other traffic such as ARP requests, and IP timing issues. A typical network usually sees only 45-50% bandwidth because of collisions. A stressed out network may only get 10%.
o There is latency between your request and the data.
 Machine and router hardware delays. Usually microseconds.
 Every hop adds delay. Usually milliseconds.
 Server response time. Usually small compared to everything else but could become an issue. Ranges from milliseconds (typical) to minutes.
o In total you should expect to take at least 50% off your promised bandwidth in an idea case. This brings out 100 Mb/sec connection to more like 50 Mb/sec; but as stated earlier, we are ignoring this.
o Internet speed is based on more than your connection speed. The bandwidth of the server is also very important. You may have sufficient bandwidth but if you request from a server that is slower than your connection, you are stuck with their speed. I find that a typical website will only transfer up to 50 kb/sec so you will have to download from many different servers at the same time to fill your 100 Mb/sec pipe.
• You have enough computing power. At 100 Mb/sec you are starting to get into the range of IDE hard drive data transfer range. You will also want to have several threads going at the same time to maximize bandwidth utilization. You want to download a different webpage while you are waiting on the request for a separate page. Better yet, you want to keep your bandwidth pipe full even if you hit a slow server or a timeout which can be up to 2 minutes. I would guess that you would need 150-300 threads or requests going at the same time to meet this demand. A single computer likely will not be able to do this alone so you would end up with at least 5-10 servers on your end to pull this off. This of course breaks the idea case of no network congestion or collisions as described earlier.
• You have enough storage space. A quick search shows that YouTube alone has around 7.7 petabytes of content (http://beerpla.net/2008/08/14/how-to-find-out-the-number-of-videos-on-youtube/ . Newegg is showing 1TB hard drives for around $90. With the needed hardware and controllers, you are looking at around $100/ TB. At this rate you will need 7700 1 TB hard drives which would cost you around $770,000. A related article on BackBlaze (http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/) shows you how to build your own 67 TB 4U rack server for $7,867 including drives and rack hardware. At the BackBlaze rate, 7.7 PB will cost $904,118 or almost 1 million dollars.


Gottchas
• Connection speeds are measured in BITS and not BYTES. There are 8 bits to a byte so this means that you need to divide your connection speed by 8 right off the top. This will make our 100 Mbit/sec connection a 12.5 Mbyte/sec connection. With typical network delays, this would become 6.25 Mbyte/sec.


Now let’s do some calculations (whips out trusty TI-89 calculator).

12.5 Mbyte/sec*60 seconds = 750 MB/min
750 MB/min* 60 mins = 45 GB/hour
45 GB/hour *24 hours = 1080 GB/day or ~1 TB/day (1.08e12)

With the YouTube example above of 7.7 petabytes (10e15)…

7.7e15 Bytes/1.08e12 Bytes/day=7129.63 days
7129.63 days/365 days/year = 19.5332 years

Just downloading the YouTube database with an insane Internet connection will take you almost 20 years and almost 1 million dollars just in hard drive storage.

Hope this answers your question wink.gif
Go to the top of the page
 
+Quote Post
wutske
post Jan 21 2010, 10:35 PM
Post #8


Way Out Of Control - You need a life :)
Group Icon

Group: [HOSTED]
Posts: 1,338
Joined: 2-August 05
From: Kapellen (Antwerp, Belgium)
Member No.: 7,585
myCENTs:34.72


With your 4Mbps download speed you'll never be able to keep up with all the data that is put on the internet daily, especialy on sites like Youtube (do note, Youtube uses 7.7PB for storing all the data, but of every video they keep the original plus a 360p, 480p, 720p and 1080p version where possible, the most efficient way would be to either store the highest resolution video or the original video).

The next problem is power consumption. A single disk doesn't use a lot of power, especialy compared to a modern cpu. But 1000 disks easily use a few KWatts, generating tons of energy which you have to cool down.

You'll also need a huge room for all the racks and for extra free air for cooling purposes (a backup).

Imho, it's impossible to do.
Go to the top of the page
 
+Quote Post
levimage
post Jan 22 2010, 03:47 AM
Post #9


Premium Member
Group Icon

Group: [HOSTED]
Posts: 226
Joined: 1-October 07
From: United States
Member No.: 25,237
myCENTs:63.28


5 years from now it would not be impossible but then again there will probably be 25 times the data out there. Interesting huh?
Go to the top of the page
 
+Quote Post
starscream
post Jan 22 2010, 06:16 AM
Post #10


Premium Member
Group Icon

Group: Members
Posts: 363
Joined: 21-September 09
From: Land of Shadows
Member No.: 42,995
myCENTs:87.17


I came across this blog post while surfing, i'm not sure if this is the one company that is upto the download and archiving the internet. Removing the video and audio content from the web, text based sites are easy to crawl and archive i guess if these companies are doing it.

Check this blog post again. Interesting read, not sure that site is still offering such service or not. But i guess archiving wikipedia is possible to some extent then it is not bad idea i think that is also enough for some people.
Go to the top of the page
 
+Quote Post

2 Pages V   1 2 >
Reply to this topicNew Topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 

Collapse

> Similar Topics

    Topic Title Replies Topic Starter Views Last Action
No New Posts   10 Eggie 691 19th August 2010 - 09:41 PM
Last post by: levimage
No New Posts   16 H.O.D. 557 11th August 2010 - 05:36 PM
Last post by: H.O.D.
No New Posts   6 FirefoxRocks 290 27th July 2010 - 05:26 PM
Last post by: yordan
No new   99 Cookiemonster 18,213 15th July 2010 - 11:32 PM
Last post by: iG-Murtaza
No New Posts   16 soleimanian 4,142 14th July 2010 - 02:10 AM
Last post by: iG-Dan Allen
No New Posts   5 brishisharma 5,996 19th June 2010 - 09:51 AM
Last post by: iG-Blue Blade
No New Posts   1 Ahsaniqbal111 132 18th June 2010 - 05:38 PM
Last post by: Боја
No New Posts 9 viewertom 1,611 16th May 2010 - 05:21 PM
Last post by: iG-patricia
No New Posts   14 WeaponX 5,766 8th April 2010 - 03:46 PM
Last post by: iG-Adex
No New Posts 14 sajjadnaveed 4,335 6th April 2010 - 04:19 PM
Last post by: iG-Murtuza
No new   26 ejfetters 10,090 24th February 2010 - 06:49 AM
Last post by: iGuest
No new   29 tamer3kz 5,421 24th February 2010 - 05:41 AM
Last post by: iG-
No New Posts   17 Ajay Shivaa 3,098 19th February 2010 - 09:45 AM
Last post by: dangerdan
No New Posts   10 nightfox 3,133 9th February 2010 - 05:55 PM
Last post by: iG-abcd
No New Posts   0 8ennett 405 29th January 2010 - 03:57 PM
Last post by: 8ennett


Web Hosting Powered by ComputingHost.com.
HONESTY ROCKS! truth rules.
Creative Commons License