Message Boards Message Boards

GROUPS:

[LiVE] Live Coding Sessions from Andreas Lauschke

Posted 9 months ago
4948 Views
|
31 Replies
|
32 Total Likes
|

There are size / capacity restrictions on this community site that I've run into (1 GB file size, and max number of attached files is 5), therefore I've moved everything on my own server. Please bookmark

http://andreaslauschke.net/wri-twitch.html

31 Replies

Looking forward to the 2nd live-coding session

Thank you, I enjoyed your session and am looking forward to the next one.

I only just tripped across your first presentation. Great, great stuff. Operators, associations and datasets have been awkward material for me, as I am "hardwired" to lists and traditional functional programming. Please continue at the level and pace you have set in the first presentation.

Thank you very much for your constructive feedback. At this point there is no planned end date for my live coding sessions, and I expect many more sessions, as the data scientist's progression

  • data sourcing / handling / filtering / aggregation
  • application (optimization, statistics, AI / NN / ML / DL, ...)
  • pure math <--> applied math

is a sheer endless paradise for the serious analyst / data scientist who can harness the appropriate tools to make inroads. And there is no other software system that is as highly integrated as the M system, so here is the place where I will demo the applicability of the M system to tackle real-world problems with concise and intuitive code.

After a few more sessions I'll prioritize the content based on audience feedback. There are many topics relevant for the professional data scientist, so that I have to start balancing general appeal with audience requests:

parallelism in computation, compilation for speed-up, combining the two: CUDA, databases, AI, crypto, dynamic interactivity, JLink, units framework, persistence, web, cloud, ... -- I won't run out of content soon. And after a while I think it will get more math-y: advanced regression, calculus of variations, control theory, region-based computing, differential geometry, ODEs, PDEs, ... all of which should be part of the professional data scientist's arsenal -- at least their very basics, and we can't get too detailed, to keep it sufficiently general to be of interest to everyone.

To me, this is a very encouraging and exciting trajectory of presentations. I use, awkwardly, Mathematica as my goto platform for data munging and analysis. Unfortunately, I do not have the adroitness of skills of the experts presenting for Mathematica; so I have to study carefully their code to understand sufficiently what is being done so that I can master the techniques myself. At my age, I have to fight against the rigidity of my past ways of doing things in order to grok the more useful methods. I thank you greatly for taking the time to think through how to present the Wolfram Language is a meaningful, comprehensible and sequential way for an individual primarily interested in data acquisition and analysis. (A 71 yr old hobbyist programmer).

Thank you very much for your comment. Yes, to live means to learn and to improve. When we don't learn, we bereave ourselves of opportunities to grow. Virtual Greetings!

Posted 8 months ago

Hi Andreas,

Looking forward to seeing future live coding sessions by you. Could you please also attach PSD001.nb, it is missing from the current list of attachments.

Thanks, Rohit

sorry, I must have accidentally deleted it. It was there, I know that for sure.

Posted 8 months ago

Hi Andreas,

You have used XETRA data in your notebook "AssocDataset002.nb". Could you please give us the corresponding URL of the csv file at Deutsche Boerse?

By the way, I highly appriciate your live coding sessions and I am waiting already for the notebook of part 3.

Viele Grüße Jürgen

Posted 8 months ago

It's in the .nb, look at PDS001.nb, that was the first week. Here they are:

in AWS data registry: https://registry.opendata.aws/deutsche-boerse-pds/

documentation: https://github.com/Deutsche-Boerse/dbg-pds

so for example "https://s3.eu-central-1.amazonaws.com/deutsche-boerse-eurex-pds/2019-04-18/2019-04-18BINSXEUR14.csv" to get the 14:00 MEZ file for Apr 4 for Eurex.

same for XETRA, use https://s3.eu-central-1.amazonaws.com/deutsche-boerse-xetra-pds/2019-04-18/2019-04-18_BINS_XETR14.csv

Uploading the file for part 3 now, I tend to wait until the YouTube video is ready, which usually takes a few days, I didn't see it live until yesterday eve.

Und Tschüß, Andreas

Posted 7 months ago

Hi Andreas,

Thank you for presenting Part 4. Could you please attach the associated notebook.

Posted 7 months ago

Dear Andreas,

Many thanks for your livecoding sessions. Often, I am not able to join them live, but I visit this post and the videos on a regular basis.

Are you planning on a livecoding session about how to manage large datasets (size near or exceeding the RAM of your computer)?

I am thinking of data streaming, processing large data sets, saving large datasets (from smaller chunks) in different formats for data exchange with other systems than Mathematica etc.

I am looking forward to learning from your next livecoding session.

Kind Regards,

Dave

I can present about this, but I try to have my sessions largely driven by audience requests. So far, you are the only one about this, and I have reasons to present about this not too soon (see point 1 below). However, I have some general comments about this:

  1. there is a piece of technology in the works for out-of-core processing. A very senior WRI programmer is working on this, and it's not finished yet. I'd rather wait until that is complete, and then showcase that piece of beauty, instead of presenting about something that would be even better once that future piece of built-in technology is usable. At this point, I think we should simply wait. This guy never writes bad code. Just wait. In principle, you can do out-of-core processing on your own already, https://www.wolfram.com/language/11/neural-networks/out-of-core-image-classification.html?product=language is an example for image classification. But that is specific, not generic, and I prefer generic over specific. (ability is more valuable than knowledge, one of my philosophies).
  2. I'm ardently supporting the philosophy that data that isn't needed by the kernel, shouldn't be in the kernel in first place. Think about it as: "kernel memory is precious" (it does actually consume extra memory). Don't ever handle data that you don't need. With that said, I'm a firm believer of pre-processing / filtering / extracting the salient data (opposite perspective: pre-bunking data you won't need) before loading into the kernel. It's no different from a database retrieval we're all familiar with: you submit a query to receive only want you want! And in this matter I've posted a reply on m.se some 6.5 years ago: https://mathematica.stackexchange.com/questions/16048/how-do-you-deal-with-very-large-datasets-in-mathematica/16060#16060. I strongly recommend that people use Linux tools, to pre-extract the salient data. Depending on your data situation, and depending on your ability to use some smart pre-processing that can reduce your data ingestion to be performed by the kernel significantly, try to shoot for 90% or 95% or higher for pre-bunkable data. Oftentimes the remainder will fit into the memory space available to the kernel just fine. If you're on Win, look at cygwin or MobaXterm, two wonderful Linux tools system available (note: MSFT announced they'll support some Linux distros in the future, no date announced as of yet, and I wouldn't take the first few versions as MSFT likes to botch things up, but eventually I venture the guess that this will be of good quality). Also, Win powershell may be a decent vehicle for data pre-processing available on Win right now. I believe it is, but how would I know, given that I only use Fedora? So that would be an avenue I recommend walking on, regardless of item 1. I think you should always do item 2, and in the future we'll have item 1 on top of it. But still, do item 2 anyways. Never don't pre-extract the salient data on the command line.
Posted 5 months ago

Hello Andreas,

Thank you for your seminars, good balance between theory and practice. Regarding outliers, are you familiar with the false neighbours notion (Kennal, Abarbanel)? I think it is relevant and wonder how Wolfram Language would accelerate computations in this area.

Best regards,

Marek Wojcik

Yes, I think theory and practice go hand in hand. Development of theory should have a purpose, and conversely, theory greatly enhances application. Remember: nothing is more practical than a good theory (Kurt Lewin, 1951). Regarding the false neighbors concept from Kennal and Abarbanel, I can NOT see the nexus to outlier detection. It is based on higher-dimensional embeddings, and my simple intro was dealing with 1dim and 2dim data. I think I can include it when I talk about nearest neighbor concepts as preparation for various other fields of applications (AI / NNs, optimization, clustering, regression, graph theory, asf.). I think it would be more suitable for understanding the nearest neighbor concept (and derived methods) in general, than specifically for outlier detection.

Hi Andreas,

Would it be possible for you to post DatasetQueryWebScraping005.nb for download? You mention it as the notebook for the June 18 tutorial on Query and web scraping, but there is no link.

Thanks, Stephen Sheppard

I had uploaded it earlier. It's absolutely strange. Someone had commented a while ago that a file I had uploaded was missing. I just uploaded 005 again, but now 006 is missing. I guess you can only upload six files, or some MB limit. I'll discuss with tech support, it seems there is some upload limit. Thanks for bringing this to my attention.

Thanks! Must be a bug, but I had already downloaded 006 and so now I can work through the material you covered in 005. Appreciate your webinars, which are really helpful.

When is the 7th session going to be?

probably Wed 4th.

8th session will be Fri 20th, 5pm ET Topics: Closer look at RarerProbability, and some fundamentals of continuous distributions, applied to RarerProbability

Posted 1 month ago

I found these lectures very informative.

On the 2016 election results I started with a simple check of the data from your website.. (Used as a dataset but displayed as normal form for insertion in this post)

Query[Total, {"votes_dem", "votes_gop"}]@results

<|"votes_dem" -> 6.357110^7, "votes_gop" -> 6.3979810^7|>

Interesting but Trump did not win the popular vote.

A simple query

You are correct that the data is incorrect, in particular, it is incomplete. I had mentioned that, specifically that the Clinton votes for CA are incomplete, in a previous session in the homework section at the end. I scraped the data from someone else on github, well knowing it is incomplete. My suspicion is that for this dataset, they stopped counting once they knew that the state could be called, which for CA was very early, as Hillary took CA by a landslide. That means, the clearer the vote was for one of the candidates, the more incomplete that state's total is, because it was fair to stop counting early, the state was called. The original data source also had a field "percent reporting", and those were numbers like 97%, that means not all votes are in. If you look closely, also the Trump votes are "undercounted". I plan to revisit this topic with the county-level data on politico.com, all their counties have percent reporting: 100%. And their totals match the numbers on the wikipedia page about the 2016 election. But I have deferred that for now, as I plan to combined that with another example about web-scraping (from the politico.com website). I want to combine these two things. So that has to wait a little longer, as my next sessions will be about large datasets (I'm stopping short of calling it "big data") and parallelism. My next session will probably be next week. But, trust me, I'll revisit this. When I scraped this data from github, I didn't know yet of the politico data. Thanks for your comment!

Posted 1 month ago

Thank you.

Your entire series has been an excellent presentation of Mathematica possibilities.

my pleasure!

Posted 15 days ago

I have found your presentations very helpful, and I am looking forward to more sessions!

Are you planning to cover training neural networks with large data sets? I have been experimenting with NN training, but I am limited to smaller data sets because I do not have a local GPU. I understand it is possible to use WL script/engine with a cloud computing service (e.g. AWS, etc.) for net training. Is it also possible to bring the trained network into the local machine for further analysis? I was wondering if you could cover this in a session.

several points: I do plan to show NN programming in my data science track, but I don't know yet to what extent. Also reinforcement learning (with and without NNs -- most people don't know this, but reinforcement learning has many extremely useful applications that don't involve NNs at all). My next sessions in the data science track will be about parallelism features, and followed directly by CUDA programming, which in my opinion is the logical extension of parallel kernels and takes parallelism to the next level. That will also lead me to LibraryLink (because the most powerful CUDA applications run from M naturally bring you to LibraryLink), and then I'll have sessions about the other link products (J/Link, NETLink, LibraryLink, RLink, etc.). That means NNs is not on my near-term plan to begin with (I consider parallelism and the link products "infrastructure", and I want to cover infra first). Next, I'll start a parallel track about financial options theory after the next data science session (probably on the 23rd). These will alternate about weekly between a data science session and a financial options theory session. I was discussing a few tracks with the PR group: financial options, differential equations, combinatorial optimization, NNs / ML (or perhaps more general, AI, which would include reinforcement learning). I decided to go with financial options first, because I can bring that to a close in about half a year. The others would take me a year or more (as I don't like to skimp over important things and like to dig deep). Next, WRI has plans to have some of their employees to develop such programs for the interactive classes, I probably shouldn't disclose any details about that (extent, length, people, covered content) -- just letting you know that this is or will be in the works. I want to see that first, one of these programs won't start until the fall. And as my financial options sessions with PR will take me until some time in the summer, I'll make my decision which track to do after my finopt ends, no sooner than this summer. It may or may not be NNs (or more comprehensively AI), I'll decide that then together with the PR group. But with that said, NNs / ML is indispensable for the modern-day data scientist, so I'll have to cover it, but not after several sessions about parallelism, CUDA, and link products are covered first, and not in such detail as I'd like to use if I were to conduct a NN / ML / AI track starting this summer. I can't do my data science track only about AI. Better to make that a separate track. With that said, though, I strongly recommend getting a local NVidia GPU as well. Simple cards don't cost much, you can get relatively new RTX cards (2060s, 2070s) for less than a kilobuck. Also the 1080/1070/1060 cards get cheaper as these Volta generation cards did not see the breakthrough through crypto mining many GPU miners were hoping for (NVidia has stockpiles of surplus of Volta cards, as you may know, the whole Volta generation was about data centers). I normally wouldn't recommend Volta cards anymore, long superseded by the Turing generation, but Volta cards are now VERY cheap, a few hundred bucks, and that's good enough for a start -- not good enough for real power computing. If you only have a schlepptop without NVidia card, you have run into a snag, I've seen people use external GPUs with some external PCI dock, but I don't know if that's really powerful, and I don't like an external cable-and-power-supply mess. In short, if you only have a schlepptop, get a real box. You get more bang for the buck, or the opposite perspective: you pay less for the same bang. Yes, you can retrieve a trained network file, computed in the cloud, and then use it on your local machine for inference (sometimes called prediction). But even then you probably want a local GPU. I give you another reason: the GPUs in AWS are getting beefier and beefier all the time. I believe they're even phasing out their older ones, like the K80s, they're just not as powerful as the modern cards, and they're the old Kepler generation, compute capability 3.7. That's pre-pyramids by today's standards. But for the newer generations that means you pay significantly more per hour, some GPU instances are 10 - 20 bux per hour. If you do that a lot, it will cost you more than getting a new workstation with a decent CUDA card. Also, if you "fill" these big cards with heavy jobs, then the resulting networks will probably be so big and complex that you would want a local GPU for inference/prediction. Doing that without a local GPU will not be powerful, you can do that only with small/simple networks. But those you can create on your own, no need to pay for the beefy GPU instances you find on AWS (you pay for time, not problem size, so a massive job will cost you the same as a toy problem on the same instance type). I hope that explains my view on this from both sides.

Posted 14 days ago

Thanks for the hardware recommendations - I will look for a good GPU solution. Also, the reinforcement learning sessions sound very interesting.

nothing in AI is boring. NOTHING! Prove me wrong :)

"Artificial Intelligence" (AI) is just a marketing term. Designed and used to extract money from the USA military (initially) and from venture capitalists (now.) In other words, "opium for the masses" (with money.)

After a "hot period" a severe "cold period" follows. That pattern -- "pendulum swing" -- has been observed before. (A lot in general, twice just within "AI".)

Boring...

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract