Nixers project: Bittorrent library - Community & Forums Related Discussions

Users browsing this thread: 1 Guest(s)
z3bra
Grey Hair Nixers
Thanks for having think about the implementation!

I didn't think a lot about it myself, but my first idea was to go with your latest proposal: write directly to the end file, and allocate the full size (which is known) first. This has the advantage of failling quickly in case you don't have enough space.

First of all, let's use the correct terms. A "torrent" is the final file. Each torrent is splitted into "pieces", each having a hash, and each piece is sent as multiple "block" (see the RFC for details).

Your idea is to put each piece into its own file, and then, concatenate them when you jave all the parts. The main advantage I see here is that you can easily know which part you've already downloaded, allowing easy stop and starts without a "cache file".
The main disadvantage is that you will have a lot of I/O upon concatenation, and that would be a single threaded process.
This would also use a shitons on inodes, but that's not the biggest issue.

The "sort as they come" idea seems slow, bloated and complex to implement for no practical reason, so I'll just ignore it.

By writing directly to the file, you save some time, as when you'll receive the last piece, your end file will be ready. This is also the simplest approach for me, as you will know the size of pieces, and their position, so writing to the file is just an fseek() away. The problem is that this is not atomic, and require a "cache file" to remember which pieces are retrieved (altough you should be able to reconstruct it by checking which pieces of the end file are zeroed).

For now, I find the later simpler, as it eliminates intermediary parts from the process. It comes down to
receive a piece, check its hash, write it to the file, and that's it. Your idea adds another step which is "put all the parts together", and another step means more code, and thus more attack surface for bugs :)


I think there is no urge to settle this now.
For now, the structs for pieces/blocks are not yet finished. We also cannot talk to trackers, or other clients. The advantage of this RFC is that (I think) all parts are explained in a logical order. For example, you cannot understand how a metainfo file is written if you don't what bencoding is.
You're talking here about the way we write pieces received to the torrent, which comes, I think at 6.3.10 in the RFC ("Piece" message).
The library is currently stuck at point 3: "pieces and blocks". We have many things to do before deciding how to write them!

Implementing the full protocol is a big task, and we need to stay focused on what needs to be done if we want to get it working quickly.

I chose to implement all the points in the RFC one by one, in the order they appear. It might not be the best way to handle this project, but at least I can easily know where I'm at, and what needs to be done.
If you think of a better way, we can still discuss it though, I wanted this project to be the one of the community, not just mine.
josuah
Long time nixers
(06-08-2017, 06:40 PM)z3bra Wrote: The main disadvantage is that you will have a lot of I/O upon concatenation, and that would be a single threaded process.
This would also use a shitons on inodes, but that's not the biggest issue.

Thanks, I did not think about this.

(06-08-2017, 06:40 PM)z3bra Wrote: Writing directly to the file [...] this is not atomic, and require a "cache file"

I mostly thought about concurrent seeding and leeching. I asked "is it possible to read and write a file at the same time" ? If not, one big torrent file open for writing one piece could not be used for reading another piece. But z3bra said on irc: "you can do that by mmap'ing the file".

So no real advantage to save the parts on different files, a big buffer file can provide everything needed.

The cache file will contain a list of completed pieces, this will bring atomicity back: a not fully written piece will not be in this cache file, and will be downloaded again if something went wrong.

(06-08-2017, 06:40 PM)z3bra Wrote: I think there is no urge to settle this now.

Let's begin with the beginning. I can get started with UTF-8, as discussed with z3bra:
http://bittorrent.org/beps/bep_0003.html tells us that:

Quote:All strings in a .torrent file that contains text must be UTF-8 encoded.

And the clients using the library may have broken locale, or simply use a different encoding. The wchar.h (C90) standard library can convert the currently running operating system encoding whatever it is to a encoding independent number (character position in Unicode).

(06-08-2017, 06:40 PM)z3bra Wrote: I chose to implement all the points in the RFC one by one, in the order they appear. It might not be the best way to handle this project, but at least I can easily know where I'm at, and what needs to be done.

We will need to notify each other our progress and self-assignments.
z3bra
Grey Hair Nixers
Just a quick bump to thank xkco for his participation. It's not much, but it shows the interest you have in the project, and the will to participate. Be it a welcoming start for anyone willing to participate in this project!
josuah
Long time nixers
xkco: you rock!



Venam started to make structs to hold the metainfo. That is a good idea I think, as it will help us work on multiple parts of the code independently and merge everything easily.

Here is the proposition: if we need to have an array of complex items, use a struct. Otherwise just use variables.

We can have multiple torrents per process and do not know how many of them.
I think we need a 'torrent' struct (torrent[i]->data).

Some torrents can have multiple files, struct as well (files[i]->path).

Code:
struct file {
    size_t       length;
    char        *path;
}

Code:
struct torrent {
    char        *announce_list[];   /* array of urls             */
    char        *name;
    size_t       length;
    size_t       pieces_length;
    size_t       pieces_count;      /* for convenience           */
    char        *pieces_sha1;
    char         torrent_sha1[20];  /* to compute ourself        */
    struct file *files[];           /* is NULL for single-file   */
    char        *data;              /* pointer to mmap-ed memory */
    char        *downloaded;        /* bitmask for the parts     */
}

If there is only announce on the metainfo file, the 'announce_list' would only
hold the first (prefered) url of tracker:

Code:
torrent->announce_list = { "http://tracker-1.tld", NULL }

I do not see the need to separate metainfo from the actual data.

We could do a 'part' struct as well, to avoid implementing a bitmask, and for convenience.
z3bra
Grey Hair Nixers
The bitmask will have to be implemeted imo, as the RFC says it should be sent to peers. Here is what I had in mind:

Code:
struct torrent {
    char *announce[];
    char *files[];
    uint8_t bits[];
    struct piece {
        size_t len;
        uint8_t sha1[20];
        uint8_t data[];
    } pieces[];
};

This is, IMO the bare minimum that needs to be implemented. We could then add on top of that things like comments, creation date, ... (all optionnal fields actually).
josuah
Long time nixers
That looks good to me. IMHO, the fewer data, the more stateless it will be ;) and the easier it will be to manage its state.

I did not know you could define nested structs. That is pretty neat.

One detail, though: how do we now at which byte to cut the big data blob to extract the files if we do not know their length?

Code:
struct torrent {
    char *announce[];
    uint8_t bits[];
    struct file {
        char *name;
        size_t len;
    } files[];
    struct piece {
        size_t len;
        uint8_t sha1[20];
        uint8_t data[];
    } pieces[];
};
z3bra
Grey Hair Nixers
Good catch! I forgot to add a field for the file size. I'd use "path" rather than "name" though, as in the RFC, "name" refers to the filename without the path, while we'd store both in this case.
josuah
Long time nixers
(12-08-2017, 08:42 AM)z3bra Wrote: I'd use "path" rather than "name" though.

That is fine.

I'm ok with adding it to the library then. :)
nas
Members
Hi Guys, I'd like to try contributing to this project but two things:
1) This would be my first time doing some "real" thing in C. (I use Java for work)
2) How do we test it if it works?
josuah
Long time nixers
(24-08-2017, 11:53 AM)nas Wrote: Hi Guys, I'd like to try contributing to this project but two things:
Quote:Nice! :) Welcome to the project. Make sure to join #unix @ irc.nixers.net if you have questions or ask here.

[quote="nas" pid="18868" dateline="1503586386"]
1) This would be my first time doing some "real" thing in C. (I use Java for work)

I started C by contributing to dvtm. I sent was awful patches, and martanne helped me to correct them (they were not merged into dvtm, though).

[quote="nas" pid="18868"]
2) How do we test it if it works?

We can setup our own tracker with existing clients: We need 2 things for a BitTorrent system: clients, a tracker (not really the server).
  • The clients do all the heavy lifting: they pass the file directly with peer to peer (as you know already), with a "peer wire protocol" (section 6).
  • The server to help the clients find each other (maintaining a peer list) and this go with HTTP GET requests.

The ideal is to test with many different trackers and clients, but if we have only one at first it is fine.

Open Tracker looks good to me, I will try to serve an instance at http://josuah.net:6969...
josuah
Long time nixers
Here are requests sent by bittorrent trackers:

rtorrent: https://p.iotek.org/fe2
Code:
GET /?info_hash=OG%15%00%E3%8E%88%BDd%03%AE%3A%E4%CBz%9Al%BDsv&peer_id=-lt0D60-%3C%FE%A06%94-%CC%9Bi%DC%0A%8C&key=6cde84e1&compact=1&port=6915&uploaded=0&downloaded=0&left=3825205248&event=started HTTP/1.1
Host: 0.0.0.0:1234
User-Agent: rtorrent/0.9.6/0.13.6
Accept: */*
Accept-Encoding: deflate, gzip

transmission: https://p.iotek.org/bd6
Code:
GET /?info_hash=OG%15%00%e3%8e%88%bdd%03%ae%3a%e4%cbz%9al%bdsv&peer_id=-TR2920-iesno8z0fsbc&port=51413&uploaded=0&downloaded=0&left=3825205248&numwant=80&key=52541b14&compact=1&supportcrypto=1&event=started&ipv6=2001%3A4b98%3Abeef%3Aa%3Aca5b%3A76ff%3Afe48%3A2a0d HTTP/1.1
Host: 0.0.0.0:1234
User-Agent: Transmission/2.92
Accept: */*
Accept-Encoding: gzip;q=1.0, deflate, identity

There is one empty line at the end of each request, and lines are ended with '\r\n' as wanted by HTTP.

[EDIT] it seems that z3bra already implemented this part :]
nas
Members
Thanks for the info @josuah. I don't have anything to do at work but our network blocks many addresses and ports. I'll just read more for now.


I can't even connect to the IRC servers. gahd.
z3bra
Grey Hair Nixers
The GET requests sent to trackers is indeed working since commit 913fcb4. It should now support any kind of message, and it returns the next interval to wait before the next heartbeat request. There is currently no way to retrieve other parameters like the number of seeders/leechers though.

I'm currently working on PWP messages (handshake seems working now)
z3bra
Grey Hair Nixers
I'm currently facing some decisions, about how to use the library. These decisions will define how the PWP functions will be used. Here is a pseudo code of how I imagine the interface. Feel free to discuss:

Code:
int interval = 0, last = 0;
uint8_t msg[MESSAGE_MAX];
off_t offset = 0;

interval = thpsend(torrent, "started");
last = time(NULL);

/* everything is wrapped within an infinite loop
for (;;) {
    if (time(NULL) - interval > last)
        interval = thpsend(torrent, NULL);

    foreach(torrent->peers as peer) {
        if (!peer->connected) {
            pwpsend(torrent, peer, PWP_HANDSHAKE);
            pwpsend(torrent, peer, PWP_BITFIELD);
            pwpsend(torrent, peer, PWP_INTERESTED);
            pwpsend(torrent, peer, PWP_UNCHOKED);
        } else {
            if (!peer->ischoked)
                pwpsend(torrent, peer, PWP_REQUEST); /* Request a new random block */
        }

        switch (pwprecv(torrent, peer, &msg)) {
        case PWP_REQUEST:
            pwpsendblock(torrent, peer, msg);
            break;
        case PWP_PIECE:
             offset = pwprecvblock(torrent, peer, msg);
             if (sha1(torrent->pieces[offset]) == torrent->sha1[offset]) {
                 endgame(torrent, offset); /* sends PWP_CANCEL message to all peers */
                 interval = thpsend(torrent, torrent, "completed");
            }
            break;
        }
    }
}

Does that make any sense? Would you change anything?
josuah
Long time nixers
This raises a few questions to me: what if the client's or the peers network is slow or a peer does not respond?

If pwpsend calls are blocking, the client will hang and wait for the response of every client before continuing with the next request, while it could send them all and wait for all answers, while still proceeding to the downloads (which would ideally never stop).

But the primitives looks good, it may be possible to use them to build asynchronous clients. I'm not yet very experienced with fork() and shared memory, but this is how I picture it based off what you propose:

Code:
/* setup shared memory: for the count of parallel requests, blocks requested... */
for (;;) {
    foreach(torrent->peers as peer) {
        if (count < MAX_REQUESTS && !fork()) {
            count++;
            if (!peer->connected) {
                pwpsend(torrent, peer, PWP_HANDSHAKE);
                pwpsend(torrent, peer, PWP_BITFIELD);
                pwpsend(torrent, peer, PWP_INTERESTED);
                pwpsend(torrent, peer, PWP_UNCHOKED);
            } else {
                if (!peer->ischoked)
                    pwpsend(torrent, peer, PWP_REQUEST); /* Request a new random block */
            }
            count--;
    }
}

And the same style for the other calls.

Another solution would be to have "pwpsend()" and friends to be asynchronous thmself, but if PWP_HANDSHAKE, PWP_BITFIELD, PWP_INTERESTED, PWP_UNCHOKED needs to be sent in this order, you would need shared with a queue for each peer listing all operations to send in turn, with a limit for not making the queue too long (requesting a piece 38 586 times to a peer not responding and still stuck at PWP_HANDSHAKE, ...).

In either case, this looks a really sane and simple interface, without complex structs to setup, neither initlibgbt() so far.

I am not good yet at networking or async programming, so there may be better/more standard ways to achieve it.
z3bra
Grey Hair Nixers
pwpsend() call are not not blocking. They only send a message and don't expect any response. pwprecv() and friends are though. It would indeed be a better idea to make them non blocking using poll()/select() (that would be up to the user implementation then).

About the network request queue, I though about it as well. Some recommendation advise no more than 10 request queued to avoid complex overhead. Not sure how to implement this properly though.

I'm definitely not for adding fork()/pthread_create() calls IN the library. This should be the user's choice to do so. But we could indeed design the library around it, using pwppreparereq() + pwpsendreq(), and let the user decide when he wants to send the messages.
josuah
Long time nixers
Good point: keeping more out of the library. Then the library and client complexity do not to stack together. No "do the stuff" function that manages everything. :)

The queue can also be managed by the client. As an example, you might want to use the lib to try to overflow another implementation to test the request load it can handle.

Are the request also non-blocking if the client is not connected to internet anymore (and unable to receive packet at all)?
z3bra
Grey Hair Nixers
Some update on this project!

Libgbt can now download, and upload torrents! It only handle it when there is a single peer though, we need a way to prevent querying the same piece/block to multiple peers.
I didn't test multi file torrents in this configuration, but it should technically work :D

The lib also doesn't bind on a port to accept incoming connections. I will only connect to peers, but won't accept new connections (not useful for seeding)
z3bra
Grey Hair Nixers
We reached a milestone here! libgbt can up/download torrents, with multiple peers. It can accept new connections as well (when seeding only for now).

Basic functionality being there, let's fuck it up and rewrite some bits! Here is what I'm not happy with, or what I want to add/improve:
  • out request management (ring buffer?)
  • peer list management
  • keep track of requests
  • end game algorithm
  • in/out net throttling
  • rework grizzly API

I'm not sure how the API should be used. I'd preffer something as simple as possible, like the following:

Code:
struct torrent to;
grizzly_load(&to, "file.torrent");

/* start download */
while(!grizzly_finished(&to))
    grizzly_leech(&to);

/* torrent is complete */
for (;;)
    grizzly_seed(&to);

If anyone has an idea or input, I'll welcome them!
josuah
Long time nixers
"I have long been of the opinion that a good software project must be written at least three times: once to understand the problem, once to understand the solution, and once to actually write it properly."

https://bearssl.org/goals.html
Laserswald
Members
(09-10-2017, 02:36 PM)z3bra Wrote: We reached a milestone here! libgbt can up/download torrents, with multiple peers. It can accept new connections as well (when seeding only for now).
Basic functionality being there, let's fuck it up and rewrite some bits! Here is what I'm not happy with, or what I want to add/improve:

z3bra,

Do you have a test suite set up? It would make things (like fucking up the API to make it easier to use) a little bit safer.

I can help with writing those tests if you like; when it comes to small C projects I would usually go for just some executable that links to the library and calls assert() a bunch.
z3bra
Grey Hair Nixers
It had tests at first, for internal functions (bencoding related mostly). Now that only the API is accessible I don't see the point for tests. What would the test do? Download a file, and checksum it? you'd depend on access to the network here. Another idea would be to use a local tracker, spawned for the tests. Not really helpful either...

I wanted to have tests at first, but the simplest way to test is to do it manually at this point.

But I must admit I suck at writing tests, so if you have an idea, please speak up! :)
josuah
Long time nixers
Thanks again for opening your thoughts on this project to the community.

(06-08-2017, 06:40 PM)z3bra Wrote: This would also use a shitons on inodes, but that's not the biggest issue.

Code:
$ doas find / -type f -exec du -k {} + | sort -n | awk '
    {
        current++;
        expected += $1 / 8 + 1;  # with 8 kB per block
    }

    END {
        print("current: " current);
        print("expected: " expected);
        print("ratio: " expected / current);
    }
'
current: 332455
expected: 8928090
ratio: 26.855

About 27 times more inodes in my case.

Code:
$ df -kih
Filesystem     Size    Used   Avail Capacity iused   ifree  %iused  Mounted on
/dev/sd0k      298G   60.2G    223G    21%  160449 19580221    1%   /home

8928090 / 19580221 = 46%
currently used space = 21%

With a size of 8kB per block, with my current setup, I would run out of inodes before I run out of space. You are right. :-)
z3bra
Grey Hair Nixers
hello there!

I'm bringing back to life this old thread. Due to the lack of interest in the original project (and seeing the mess I created...), I decided to rewrite this project from scratch, by using another approach: http://git.z3bra.org/libeech/log.html

It now uses poll() instead of select(), and make use of callbacks functions to receive messages.
The bencoding part has also been fully reimplement to be heapless (so not a single malloc() call!)

As I write this, the lib can connect to a peer, request pieces and write them to a cache file.
To put it simply, I can download a file via BitTorrent!

I'm now reaching some important steps in the implementation, regarding management of cache files, choosing to store things in memory or on the disk, and more importantly, how to communicate with multiple peers at once!

I'm looking for help in networking programming in C, or any idea regarding torrent download/upload algorithms.

Who's interested?